OPM Data Separation Analysis

Christopher Boomhower1, Stacey Fabricant2, Alex Frye1, David Mumford2, Michael Smith1, Lindsay Vitovsky1

1 Southern Methodis Univeristy, Dallas, TX, US 2 Penn Mutual Life Insurance Co, Horsham PA </i></b>

Introduction

background text...

our intent is to: 1)..2)...3)........

In [ ]:
 

Data Understanding

Data Source Background Text & citation links

Dataset Attribute Descriptions

Load the Data

To begin our analysis, we need to load the data from our 89 source .txt files. Data is separated into two separate groups of files; Separation and Non-Separation, thus data is loaded in two separate phases, then unioned together. Once data is loaded, Steps taken to remove non-US observations or those with no specified occupation, no specified salary, or no specified length of service level. Of a total 8,423,336 observations, we end with 8,232,375 after removal of these observations.

In [1]:
## Import libraries
import pickle
import os
import psutil
import glob
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import requests
import json
import missingno as msno
import prettytable
import math
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.utils import class_weight
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix
from sklearn.ensemble  import RandomForestClassifier
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import log_loss
from sklearn.metrics import roc_auc_score
from datetime import datetime
from dateutil.parser import parse
from itertools import cycle
from sklearn import metrics as mt
import itertools


## Library Options

pd.options.mode.chained_assignment = None

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
/usr/local/es7/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [3]:
## Pre-defined Functions for use later
def pickleObject(objectname, filename, filepath = "PickleJar/"):
    fullpicklepath = "{0}{1}.pkl".format(filepath, filename)
    # Create a variable to pickle and open it in write mode
    picklefile = open(fullpicklepath, 'wb')
    pickle.dump(objectname, picklefile)
    picklefile.close()
    
def unpickleObject(filename, filepath = "PickleJar/"):
    fullunpicklepath = "{0}{1}.pkl".format(filepath, filename)
    # Create an variable to pickle and open it in write mode
    unpicklefile = open(fullunpicklepath, 'rb')
    unpickleObject = pickle.load(unpicklefile)
    unpicklefile.close()

    return unpickleObject
    
def clear_display():
    from IPython.display import clear_output
    
## Pre-defined variables for use later
dataOPMPath = "dataOPM"
dataEMPPath = "dataEMP"
PickleJarPath = "PickleJar"
In [4]:
%%time

## Load OPMSeparation Files

OPMDataFiles = glob.glob(os.path.join(dataOPMPath, "*.txt"))

for i in range(0,len(OPMDataFiles)):
    OPMDataFiles[i] = OPMDataFiles[i].replace("\\","/")

OPMDataList = []

for i,j in zip(OPMDataFiles,range(0,len(OPMDataFiles))):
    OPMDataList.append(pd.read_csv(i, dtype = 'str'))
    display(OPMDataList[j].head())

## Load the SEPDATA_FY2015 file into it's own object
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/SEPDATA_FY2015.txt']
OPMDataOrig = OPMDataList[indexes[0]]
AGELVL AGELVLT
0 A Less than 20
1 B 20-24
2 C 25-29
3 D 30-34
4 E 35-39
AGYTYP AGYTYPT AGY AGYT AGYSUB AGYSUBT
0 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF** AF**-INVALID
1 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF02 AF02-AIR FORCE INSPECTION AGENCY (FO)
2 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF03 AF03-AIR FORCE OPERATIONAL TEST AND EVALUATION...
3 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF06 AF06-AIR FORCE AUDIT AGENCY
4 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF07 AF07-AIR FORCE OFFICE OF SPECIAL INVESTIGATIONS
QTR QTRT EFDATE EFDATET
0 1 OCT-DEC 2014 201410 OCT 2014
1 1 OCT-DEC 2014 201411 NOV 2014
2 1 OCT-DEC 2014 201412 DEC 2014
3 2 JAN-MAR 2015 201501 JAN 2015
4 2 JAN-MAR 2015 201502 FEB 2015
GENDER GENDERT
0 F Female
1 M Male
2 Z Unspecified
GSEGRD
0 **
1 01
2 02
3 03
4 04
LOCTYP LOCTYPT LOC LOCT
0 1 United States 01 01-ALABAMA
1 1 United States 02 02-ALASKA
2 1 United States 04 04-ARIZONA
3 1 United States 05 05-ARKANSAS
4 1 United States 06 06-CALIFORNIA
LOSLVL LOSLVLT
0 A Less than 1 year
1 B 1 - 2 years
2 C 3 - 4 years
3 D 5 - 9 years
4 E 10 - 14 years
OCCTYP OCCTYPT OCCFAM OCCFAMT OCC OCCT
0 1 White Collar 00 00xx-MISCELLANEOUS OCCUPATIONS 0006 0006-CORRECTIONAL INSTITUTION ADMINISTRATION
1 1 White Collar 00 00xx-MISCELLANEOUS OCCUPATIONS 0007 0007-CORRECTIONAL OFFICER
2 1 White Collar 00 00xx-MISCELLANEOUS OCCUPATIONS 0017 0017-EXPLOSIVES SAFETY
3 1 White Collar 00 00xx-MISCELLANEOUS OCCUPATIONS 0018 0018-SAFETY AND OCCUPATIONAL HEALTH MANAGEMENT
4 1 White Collar 00 00xx-MISCELLANEOUS OCCUPATIONS 0019 0019-SAFETY TECHNICIAN
PATCO PATCOT
0 1 Professional
1 2 Administrative
2 3 Technical
3 4 Clerical
4 5 Other White Collar
PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT PPGRD
0 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GL GL-GS EMPLOYEES IN GRADES 3 THROUGH 10 PAID A ... GL-03
1 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GL GL-GS EMPLOYEES IN GRADES 3 THROUGH 10 PAID A ... GL-04
2 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GL GL-GS EMPLOYEES IN GRADES 3 THROUGH 10 PAID A ... GL-05
3 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GL GL-GS EMPLOYEES IN GRADES 3 THROUGH 10 PAID A ... GL-06
4 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GL GL-GS EMPLOYEES IN GRADES 3 THROUGH 10 PAID A ... GL-07
SALLVL SALLVLT
0 A Less than $20,000
1 B $20,000 - $29,999
2 C $30,000 - $39,999
3 D $40,000 - $49,999
4 E $50,000 - $59,999
SEP SEPT
0 SA Transfer Out - Individual Transfer
1 SB Transfer Out - Mass Transfer
2 SC Quit
3 SD Retirement - Voluntary
4 SE Retirement - Early Out
TOATYP TOATYPT TOA TOAT
0 1 Permanent 10 10-Competitive Service - Career
1 1 Permanent 15 15-Competitive Service - Career-Conditional
2 1 Permanent 30 30-Excepted Service - Schedule A
3 1 Permanent 32 32-Excepted Service - Schedule B
4 1 Permanent 34 34-Excepted Service - Schedule C
WSTYP WSTYPT WORKSCH WORKSCHT
0 1 Full-time B B-Full-time Nonseasonal Baylor Plan
1 1 Full-time F F-Full-time Nonseasonal
2 1 Full-time G G-Full-time Seasonal
3 1 Full-time H H-Full-time On-call
4 2 Not Full-time I I-Intermittent Nonseasonal
AGYSUB SEP EFDATE AGELVL GENDER GSEGRD LOSLVL LOC OCC PATCO PPGRD SALLVL TOA WORKSCH COUNT SALARY LOS
0 AA00 SC 201507 C M 11 A 11 0905 1 GS-11 F 40 F 1 063722 00.8
1 AA00 SD 201509 K M NaN D 11 0301 2 EX-02 Z 46 F 1 NaN 08.1
2 AA00 SC 201506 D F 15 C 11 0905 1 GS-15 L 30 F 1 126245 04.8
3 AF** SA 201503 H M 11 C 48 2210 2 GS-11 F 10 F 1 066585 04.9
4 AF02 SD 201506 I M 15 J 35 0301 2 GS-15 O 10 F 1 156737 39.8
CPU times: user 442 ms, sys: 47.3 ms, total: 489 ms
Wall time: 486 ms
In [5]:
%%time

#print(OPMDataFiles)

print(len(OPMDataOrig))

##### Merge / Modify Codes / Aggregate Attributes to be more descriptive per the metadata files

OPMDataMerged = OPMDataOrig.copy()

##AGYSUB - AGYTYP, AGY
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/DTagy.txt']
OPMDataMerged = OPMDataMerged.merge(OPMDataList[indexes[0]], on = 'AGYSUB', how = 'left')

##EFDate - quarter, month
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/DTefdate.txt']
OPMDataMerged = OPMDataMerged.merge(OPMDataList[indexes[0]], on = 'EFDATE', how = 'left')

##AGELVL - AGELVLT
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/DTagelvl.txt']
OPMDataMerged = OPMDataMerged.merge(OPMDataList[indexes[0]], on = 'AGELVL', how = 'left')

##LOSLVL - LOSLVLT
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/DTloslvl.txt']
OPMDataMerged = OPMDataMerged.merge(OPMDataList[indexes[0]], on = 'LOSLVL', how = 'left')

##LOC - LocTypeT, LocT
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/DTloc.txt']
OPMDataMerged = OPMDataMerged.merge(OPMDataList[indexes[0]], on = 'LOC', how = 'left')

##OCC - OCCTYPT, OCCFAM
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/DTocc.txt']
OPMDataMerged = OPMDataMerged.merge(OPMDataList[indexes[0]], on = 'OCC', how = 'left')

##PATCO - PATCOT
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/DTpatco.txt']
OPMDataMerged = OPMDataMerged.merge(OPMDataList[indexes[0]], on = 'PATCO', how = 'left')

##PPGRD - PayPlan, PPGroup, PPTYP
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/DTppgrd.txt']
OPMDataMerged = OPMDataMerged.merge(OPMDataList[indexes[0]], on = 'PPGRD', how = 'left')

##SALLVL - SALLVLT
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/DTsallvl.txt']
OPMDataMerged = OPMDataMerged.merge(OPMDataList[indexes[0]], on = 'SALLVL', how = 'left')

##TOA - TOATYP
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/DTtoa.txt']
OPMDataMerged = OPMDataMerged.merge(OPMDataList[indexes[0]], on = 'TOA', how = 'left')

##WORKSCH - WSTYPT
indexes = [i for i,x in enumerate(OPMDataFiles) if x == 'dataOPM/DTwrksch.txt']
OPMDataMerged = OPMDataMerged.merge(OPMDataList[indexes[0]], on = 'WORKSCH', how = 'left')


## Modify Data Types for numeric objects
OPMDataMerged["SALARY"] = OPMDataMerged["SALARY"].apply(pd.to_numeric)
OPMDataMerged["COUNT"]  = OPMDataMerged["COUNT"].apply(pd.to_numeric)
OPMDataMerged["LOS"]    = OPMDataMerged["LOS"].apply(pd.to_numeric)

print("Original SEP data size of: "+str(len(OPMDataMerged)))
print("Removing "+str(len(OPMDataMerged[OPMDataMerged["LOCTYP"] != "1"]))+" Non-US observations.")

    ## Remove Non-US Data
OPMDataMerged = OPMDataMerged[OPMDataMerged["LOCTYP"] == "1"]

print("Removing "+str(len(OPMDataMerged[OPMDataMerged["OCCTYP"] == "3"]))+" observations with no specified Occupation.")

   ## Remove Observations with no specified occupation
OPMDataMerged = OPMDataMerged[OPMDataMerged["OCCTYP"] != "3"]

print("Removing "+str(len(OPMDataMerged[OPMDataMerged["SALLVL"] == "Z"]))+" observations with no specified Salary.")

   ## Remove Observations with no specified salary
OPMDataMerged = OPMDataMerged[OPMDataMerged["SALLVL"] != "Z"]

print("Removing "+str(len(OPMDataMerged[OPMDataMerged["LOSLVL"] == "Z"]))+" observations with no specified Length of Service.")

   ## Remove Observations with no specified LOSLVL
OPMDataMerged = OPMDataMerged[OPMDataMerged["LOSLVL"] != "Z"]

print("Removing "+str(len(OPMDataMerged[OPMDataMerged["AGELVL"] == "A"]))+" observations of Age Level A")

## Remove Observations from Age Level A (less than 20 years old)
OPMDataMerged = OPMDataMerged[OPMDataMerged["AGELVL"] != "A"]

print("Removing "+str(len(OPMDataMerged[OPMDataMerged["AGELVL"] == "Z"]))+" observations with no specified Age Level.")

   ## Remove Observations with no specified Age Level
OPMDataMerged = OPMDataMerged[OPMDataMerged["AGELVL"] != "Z"]

    ## Fix differences in spaces on WORKSCHT Column
OPMDataMerged["WORKSCHT"] = np.where(OPMDataMerged["WORKSCHT"].str[0]=="F", 'Full-time Nonseasonal',
                                np.where(OPMDataMerged["WORKSCHT"].str[0]=="I", 'Intermittent Nonseasonal',
                                         np.where(OPMDataMerged["WORKSCHT"].str[0]=="P", 'Part-time Nonseasonal',
                                                  np.where(OPMDataMerged["WORKSCHT"].str[0]=="G", 'Full-time Seasonal',
                                                        np.where(OPMDataMerged["WORKSCHT"].str[0]=="J", 'Intermittent Seasonal',
                                                                np.where(OPMDataMerged["WORKSCHT"].str[0]=="Q", 'Part-time Seasonal',
                                                                        np.where(OPMDataMerged["WORKSCHT"].str[0]=="T", 'Part-time Job Sharer Seasonal',
                                                                                np.where(OPMDataMerged["WORKSCHT"].str[0]=="S", 'Part-time Job Sharer Nonseasonal',
                                                                                        np.where(OPMDataMerged["WORKSCHT"].str[0]=="B", 'Full-time Nonseasonal Baylor Plan',
                                                                                                'NO WORK SCHEDULE REPORTED' ### ELSE case represents Night
                                                                                                 )
                                                                                         )
                                                                                 )
                                                                         )
                                                                 )
                                                          )
                                                 )
                                        )
                               )    

display(OPMDataMerged.head())
print("New SEP data size of: "+str(len(OPMDataMerged)))
display(OPMDataMerged.describe().transpose())
#del OPMDataList,OPMDataFiles
226357
Original SEP data size of: 226357
Removing 8021 Non-US observations.
Removing 55 observations with no specified Occupation.
Removing 1426 observations with no specified Salary.
Removing 3 observations with no specified Length of Service.
Removing 2570 observations of Age Level A
Removing 0 observations with no specified Age Level.
AGYSUB SEP EFDATE AGELVL GENDER GSEGRD LOSLVL LOC OCC PATCO PPGRD SALLVL TOA WORKSCH COUNT SALARY LOS AGYTYP AGYTYPT AGY AGYT AGYSUBT QTR QTRT EFDATET AGELVLT LOSLVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT OCCT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT SALLVLT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT
0 AA00 SC 201507 C M 11 A 11 0905 1 GS-11 F 40 F 1 63722.0 0.8 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 4 JUL-SEP 2015 JUL 2015 25-29 Less than 1 year 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $60,000 - $69,999 2 Non-permanent 40-Excepted Service - Schedule A 1 Full-time Full-time Nonseasonal
2 AA00 SC 201506 D F 15 C 11 0905 1 GS-15 L 30 F 1 126245.0 4.8 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 3 APR-JUN 2015 JUN 2015 30-34 3 - 4 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $120,000 - $129,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time Full-time Nonseasonal
3 AF** SA 201503 H M 11 C 48 2210 2 GS-11 F 10 F 1 66585.0 4.9 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF**-INVALID 2 JAN-MAR 2015 MAR 2015 50-54 3 - 4 years 1 United States 48-TEXAS 1 White Collar 22 22xx-INFORMATION TECHNOLOGY 2210-INFORMATION TECHNOLOGY MANAGEMENT Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $60,000 - $69,999 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal
4 AF02 SD 201506 I M 15 J 35 0301 2 GS-15 O 10 F 1 156737.0 39.8 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF02-AIR FORCE INSPECTION AGENCY (FO) 3 APR-JUN 2015 JUN 2015 55-59 35 years or more 1 United States 35-NEW MEXICO 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $150,000 - $159,999 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal
5 AF03 SC 201509 H M 13 B 06 0301 2 GS-13 I 15 F 1 92973.0 1.0 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF03-AIR FORCE OPERATIONAL TEST AND EVALUATION... 4 JUL-SEP 2015 SEP 2015 50-54 1 - 2 years 1 United States 06-CALIFORNIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $90,000 - $99,999 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal
New SEP data size of: 214282
count mean std min 25% 50% 75% max
COUNT 214282.0 1.000000 0.000000 1.0 1.0 1.0 1.0 1.0
SALARY 214282.0 66479.453855 39471.623281 3913.0 35830.0 54424.0 86910.0 393699.0
LOS 214282.0 11.708865 12.631714 0.0 1.3 6.2 20.4 71.5
CPU times: user 13.9 s, sys: 213 ms, total: 14.1 s
Wall time: 14.1 s
In [6]:
%%time

if os.path.isfile(PickleJarPath+"/EMPDataOrig4Q.pkl"):
    print("Found the File! Loading Pickle Now!")
    EMPDataOrig4Q = unpickleObject("EMPDataOrig4Q")
else:
    ## Load EMPData Files

    indexes = []
    EMPDataFiles = []
    EMPDataList = []
    EMPDataOrig = []

    for i,qtr in enumerate(["Q1", "Q2", "Q3", "Q4"]): 
        EMPDataFiles.append(glob.glob(os.path.join(dataEMPPath, qtr + "/*.txt")))

        for j in range(0,len(EMPDataFiles[i])):
            EMPDataFiles[i][j] = EMPDataFiles[i][j].replace("\\","/")

        EMPDataList.append([])

        for j,file in enumerate(EMPDataFiles[i]):
            EMPDataList[i].append(pd.read_csv(file, dtype = 'str'))
            if i == 0:
                display(EMPDataList[i][j].head())

        ## Load the FactData files into it's own object
        indexes.append([])
            ##[qtr][fileindex from EMPDataList]
        indexes[i]=[j for j,x in enumerate(EMPDataFiles[i]) if dataEMPPath + '/' + qtr + '/FACTDATA' in x]   

        EMPDataOrig.append([])

        EMPDataOrig[i] = pd.concat([EMPDataList[i][indexes[i][j]] for j in range(0,len(indexes[i]))]) 
        EMPDataOrig[i]["QTR"] = str(i+1)

            ## modify data type for numerics
        EMPDataOrig[i]["SALARY"] = EMPDataOrig[i]["SALARY"].str.replace(',', '').str.replace('$', '').str.replace(' ', '').apply(pd.to_numeric)
      
        ## Load Metadata
        ##AGYSUB - AGYTYP, AGY
        ind2 = [i for i,x in enumerate(EMPDataFiles[i]) if x == dataEMPPath + '/' + qtr + '/DTagy.txt']
        EMPDataOrig[i] = EMPDataOrig[i].merge(EMPDataList[i][ind2[0]], on = 'AGYSUB', how = 'left')

        ##AGELVL - AGELVLT
        ind2 = [i for i,x in enumerate(EMPDataFiles[i]) if x == dataEMPPath + '/' + qtr + '/DTagelvl.txt']
        EMPDataOrig[i] = EMPDataOrig[i].merge(EMPDataList[i][ind2[0]], on = 'AGELVL', how = 'left')

        #LOSLVL - LOSLVLT
        ind2 = [i for i,x in enumerate(EMPDataFiles[i]) if x == dataEMPPath + '/' + qtr + '/DTloslvl.txt']
        EMPDataOrig[i] = EMPDataOrig[i].merge(EMPDataList[i][ind2[0]], on = 'LOSLVL', how = 'left')
        EMPDataOrig[i]["LOS"] = EMPDataOrig[i]["LOS"].apply(pd.to_numeric)
        
        ##LOC - LocTypeT, LocT
        ind2 = [i for i,x in enumerate(EMPDataFiles[i]) if x == dataEMPPath + '/' + qtr + '/DTloc.txt']
        EMPDataOrig[i] = EMPDataOrig[i].merge(EMPDataList[i][ind2[0]], on = 'LOC', how = 'left')
 
        ##OCC - OCCTYPT, OCCFAM
        ind2 = [i for i,x in enumerate(EMPDataFiles[i]) if x == dataEMPPath + '/' + qtr + '/DTocc.txt']
        EMPDataOrig[i] = EMPDataOrig[i].merge(EMPDataList[i][ind2[0]], on = 'OCC', how = 'left')

        ##PATCO - PATCOT
        ind2 = [i for i,x in enumerate(EMPDataFiles[i]) if x == dataEMPPath + '/' + qtr + '/DTpatco.txt']
        EMPDataOrig[i] = EMPDataOrig[i].merge(EMPDataList[i][ind2[0]], on = 'PATCO', how = 'left')

        ##PPGRD - PayPlan, PPGroup, PPTYP
        ind2 = [i for i,x in enumerate(EMPDataFiles[i]) if x == dataEMPPath + '/' + qtr + '/DTppgrd.txt']
        EMPDataOrig[i] = EMPDataOrig[i].merge(EMPDataList[i][ind2[0]], on = 'PPGRD', how = 'left')

        ##SALLVL - SALLVLT
        ind2 = [i for i,x in enumerate(EMPDataFiles[i]) if x == dataEMPPath + '/' + qtr + '/DTsallvl.txt']
        EMPDataOrig[i] = EMPDataOrig[i].merge(EMPDataList[i][ind2[0]], on = 'SALLVL', how = 'left')

        ##TOA - TOATYP
        ind2 = [i for i,x in enumerate(EMPDataFiles[i]) if x == dataEMPPath + '/' + qtr + '/DTtoa.txt']
        EMPDataOrig[i] = EMPDataOrig[i].merge(EMPDataList[i][ind2[0]], on = 'TOA', how = 'left')

        ##WORKSCH - WSTYPT
        ind2 = [i for i,x in enumerate(EMPDataFiles[i]) if x == dataEMPPath + '/' + qtr + '/DTwrksch.txt']
        EMPDataOrig[i] = EMPDataOrig[i].merge(EMPDataList[i][ind2[0]], on = 'WORKSCH', how = 'left')

        display(EMPDataOrig[i].head())

    EMPDataOrig4Q = pd.concat([EMPDataOrig[j] for j in range(0,len(EMPDataOrig))])
    print("Original EMP data size of: "+str(len(EMPDataOrig4Q)))
    print("Removing "+str(len(EMPDataOrig4Q[EMPDataOrig4Q["LOCTYP"] != "1"]))+" Non-US observations.")
    
       ## Remove Non-US Data
    EMPDataOrig4Q = EMPDataOrig4Q[EMPDataOrig4Q["LOCTYP"] == "1"]

    print("Removing "+str(len(EMPDataOrig4Q[EMPDataOrig4Q["OCCTYP"] == "3"]))+" observations with no specified Occupation.")

       ## Remove Observations with no specified occupation
    EMPDataOrig4Q = EMPDataOrig4Q[EMPDataOrig4Q["OCCTYP"] != "3"]

    print("Removing "+str(len(EMPDataOrig4Q[EMPDataOrig4Q["SALLVL"] == "Z"]))+" observations with no specified Salary.")

       ## Remove Observations with no specified salary
    EMPDataOrig4Q = EMPDataOrig4Q[EMPDataOrig4Q["SALLVL"] != "Z"]

    print("Removing "+str(len(EMPDataOrig4Q[EMPDataOrig4Q["LOSLVL"] == "Z"]))+" observations with no specified Length of Service.")

       ## Remove Observations with no specified LOSLVL
    EMPDataOrig4Q = EMPDataOrig4Q[EMPDataOrig4Q["LOSLVL"] != "Z"]

    print("Removing "+str(len(EMPDataOrig4Q[EMPDataOrig4Q["AGELVL"] == "A"]))+" observations of Age Level A.")

        ## Remove Observations from Age Level A (less than 20 years old)
    EMPDataOrig4Q = EMPDataOrig4Q[EMPDataOrig4Q["AGELVL"] != "A"]

    print("Removing "+str(len(EMPDataOrig4Q[EMPDataOrig4Q["AGELVL"] == "Z"]))+" observations with no specified Age Level.")

        ## Remove Observations with no specified Age Level
    EMPDataOrig4Q = EMPDataOrig4Q[EMPDataOrig4Q["AGELVL"] != "Z"]

        ## Fix differences in spaces on WORKSCHT Column
    EMPDataOrig4Q["WORKSCHT"] = np.where(EMPDataOrig4Q["WORKSCHT"].str[0]=="F", 'Full-time Nonseasonal',
                                    np.where(EMPDataOrig4Q["WORKSCHT"].str[0]=="I", 'Intermittent Nonseasonal',
                                             np.where(EMPDataOrig4Q["WORKSCHT"].str[0]=="P", 'Part-time Nonseasonal',
                                                      np.where(EMPDataOrig4Q["WORKSCHT"].str[0]=="G", 'Full-time Seasonal',
                                                            np.where(EMPDataOrig4Q["WORKSCHT"].str[0]=="J", 'Intermittent Seasonal',
                                                                    np.where(EMPDataOrig4Q["WORKSCHT"].str[0]=="Q", 'Part-time Seasonal',
                                                                            np.where(EMPDataOrig4Q["WORKSCHT"].str[0]=="T", 'Part-time Job Sharer Seasonal',
                                                                                    np.where(EMPDataOrig4Q["WORKSCHT"].str[0]=="S", 'Part-time Job Sharer Nonseasonal',
                                                                                            np.where(EMPDataOrig4Q["WORKSCHT"].str[0]=="B", 'Full-time Nonseasonal Baylor Plan',
                                                                                                    'NO WORK SCHEDULE REPORTED' ### ELSE case represents Night
                                                                                                     )
                                                                                             )
                                                                                     )
                                                                             )
                                                                     )
                                                              )
                                                     )
                                            )
                                   )    

    pickleObject(EMPDataOrig4Q, "EMPDataOrig4Q")

print("New EMP data size of: "+str(len(EMPDataOrig4Q)))
AGELVL AGELVLT
0 A Less than 20
1 B 20-24
2 C 25-29
3 D 30-34
4 E 35-39
AGYTYP AGYTYPT AGY AGYT AGYSUB AGYSUBT
0 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF02 AF02-AIR FORCE INSPECTION AGENCY (FO)
1 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF03 AF03-AIR FORCE OPERATIONAL TEST AND EVALUATION...
2 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF05 AF05-AIR FORCE INTELLIGENCE ANALYSIS AGENCY
3 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF06 AF06-AIR FORCE AUDIT AGENCY
4 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF07 AF07-AIR FORCE OFFICE OF SPECIAL INVESTIGATIONS
DATECODE DATECODET
0 201412 DEC 2014
EDLVLTYP EDLVLTYPT EDLVL EDLVLT
0 1 BELOW HIGH SCHOOL 01 01-NO FORMAL EDUCATION OR SOME ELEMENTARY SCHO...
1 1 BELOW HIGH SCHOOL 02 02-ELEMENTARY SCHOOL COMPLETED - NO HIGH SCHOOL
2 1 BELOW HIGH SCHOOL 03 03-SOME HIGH SCHOOL - DID NOT COMPLETE
3 2 HIGH SCHOOL OR EQUIVALENCY 04 04-HIGH SCHOOL GRADUATE OR CERTIFICATE OF EQUI...
4 3 OCCUPATIONAL PROGRAM 05 05-TERMINAL OCCUPATIONAL PROGRAM - DID NOT COM...
GSEGRD
0 **
1 01
2 02
3 03
4 04
LOCTYP LOCTYPT LOC LOCT
0 1 United States 01 01-ALABAMA
1 1 United States 02 02-ALASKA
2 1 United States 04 04-ARIZONA
3 1 United States 05 05-ARKANSAS
4 1 United States 06 06-CALIFORNIA
LOSLVL LOSLVLT
0 A Less than 1 year
1 B 1 - 2 years
2 C 3 - 4 years
3 D 5 - 9 years
4 E 10 - 14 years
OCCTYP OCCTYPT OCCFAM OCCFAMT OCC OCCT
0 1 White Collar 00 00xx-MISCELLANEOUS OCCUPATIONS 0006 0006-CORRECTIONAL INSTITUTION ADMINISTRATION
1 1 White Collar 00 00xx-MISCELLANEOUS OCCUPATIONS 0007 0007-CORRECTIONAL OFFICER
2 1 White Collar 00 00xx-MISCELLANEOUS OCCUPATIONS 0017 0017-EXPLOSIVES SAFETY
3 1 White Collar 00 00xx-MISCELLANEOUS OCCUPATIONS 0018 0018-SAFETY AND OCCUPATIONAL HEALTH MANAGEMENT
4 1 White Collar 00 00xx-MISCELLANEOUS OCCUPATIONS 0019 0019-SAFETY TECHNICIAN
PATCO PATCOT
0 1 Professional
1 2 Administrative
2 3 Technical
3 4 Clerical
4 5 Other White Collar
PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT PPGRD
0 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GL GL-GS EMPLOYEES IN GRADES 3 THROUGH 10 PAID A ... GL-03
1 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GL GL-GS EMPLOYEES IN GRADES 3 THROUGH 10 PAID A ... GL-04
2 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GL GL-GS EMPLOYEES IN GRADES 3 THROUGH 10 PAID A ... GL-05
3 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GL GL-GS EMPLOYEES IN GRADES 3 THROUGH 10 PAID A ... GL-06
4 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GL GL-GS EMPLOYEES IN GRADES 3 THROUGH 10 PAID A ... GL-07
SALLVL SALLVLT
0 A Less than $20,000
1 B $20,000 - $29,999
2 C $30,000 - $39,999
3 D $40,000 - $49,999
4 E $50,000 - $59,999
STEMAGG STEMAGGT STEMTYP STEMTYPT STEMOCC STEMOCCT
0 1 STEM OCCUPATIONS 01 SCIENCE OCCUPATIONS 0401 0401-GENERAL NATURAL RESOURCES MANAGEMENT AND ...
1 1 STEM OCCUPATIONS 01 SCIENCE OCCUPATIONS 0403 0403-MICROBIOLOGY
2 1 STEM OCCUPATIONS 01 SCIENCE OCCUPATIONS 0405 0405-PHARMACOLOGY
3 1 STEM OCCUPATIONS 01 SCIENCE OCCUPATIONS 0408 0408-ECOLOGY
4 1 STEM OCCUPATIONS 01 SCIENCE OCCUPATIONS 0410 0410-ZOOLOGY
SUPERTYP SUPERTYPT SUPERVIS SUPERVIST
0 1 Supervisor 2 2-SUPERVISOR OR MANAGER
1 2 Leader 6 6-LEADER
2 2 Leader 7 7-TEAM LEADER
3 3 Non-Supervisor 4 4-SUPERVISOR (CSRA)
4 3 Non-Supervisor 5 5-MANAGEMENT OFFICIAL (CSRA)
TOATYP TOATYPT TOA TOAT
0 1 Permanent 10 10-Competitive Service - Career
1 1 Permanent 15 15-Competitive Service - Career-Conditional
2 1 Permanent 30 30-Excepted Service - Schedule A
3 1 Permanent 32 32-Excepted Service - Schedule B
4 1 Permanent 34 34-Excepted Service - Schedule C
WORKSTAT WORKSTATT
0 1 Non-Seasonal Full Time Permanent
1 2 Other Employees
WSTYP WSTYPT WORKSCH WORKSCHT
0 1 Full-time B B - Full-time Nonseasonal Baylor Pln
1 1 Full-time F F - Full-time Nonseasonal
2 1 Full-time G G - Full-time Seasonal
3 1 Full-time H H - Full-time On-call
4 2 Not Full-time I I - Intermittent Nonseasonal
AGYSUB LOC AGELVL EDLVL GSEGRD LOSLVL OCC PATCO PPGRD SALLVL STEMOCC SUPERVIS TOA WORKSCH WORKSTAT DATECODE EMPLOYMENT SALARY LOS
0 AA00 11 C 04 09 B 0301 2 GS-09 E XXXX 8 44 F 2 201412 1 $52,146 1.3
1 AA00 11 D 15 12 B 0905 1 GS-12 G XXXX 8 30 F 1 201412 1 $75,621 2.3
2 AA00 11 G 15 NaN D 0301 2 ES-** P XXXX 2 50 F 1 201412 1 $165,000 5.2
3 AA00 11 D 15 14 B 0905 1 GS-14 J XXXX 8 30 F 1 201412 1 $106,263 2.7
4 AA00 11 D 13 15 E 0341 2 GS-15 N XXXX 8 10 F 1 201412 1 $141,660 11.5
AGYSUB LOC AGELVL EDLVL GSEGRD LOSLVL OCC PATCO PPGRD SALLVL STEMOCC SUPERVIS TOA WORKSCH WORKSTAT DATECODE EMPLOYMENT SALARY LOS
0 HE37 04 E 07 NaN A 7404 6 WG-05 C XXXX 8 30 F 1 201412 1 $33,392 0.1
1 HE37 38 I 13 09 E 0610 1 GS-09 F XXXX 8 10 F 1 201412 1 $64,363 10.3
2 HE37 40 H 04 04 E 0382 4 GS-04 C XXXX 8 10 F 1 201412 1 $34,862 14.1
3 HE37 38 J 10 08 F 0649 3 GS-08 E XXXX 8 10 F 1 201412 1 $57,012 17.9
4 HE37 35 J 07 06 J 0661 3 GS-06 D XXXX 8 10 F 1 201412 1 $45,828 38.8
AGYSUB LOC AGELVL EDLVL GSEGRD LOSLVL OCC PATCO PPGRD SALLVL STEMOCC SUPERVIS TOA WORKSCH WORKSTAT DATECODE EMPLOYMENT SALARY LOS QTR AGYTYP AGYTYPT AGY AGYT AGYSUBT AGELVLT LOSLVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT OCCT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT SALLVLT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT
0 AA00 11 C 04 09 B 0301 2 GS-09 E XXXX 8 44 F 2 201412 1 52146.0 1.3 1 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 25-29 1 - 2 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $50,000 - $59,999 2 Non-permanent 44-Excepted Service - Schedule C 1 Full-time F - Full-time Nonseasonal
1 AA00 11 D 15 12 B 0905 1 GS-12 G XXXX 8 30 F 1 201412 1 75621.0 2.3 1 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 30-34 1 - 2 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $70,000 - $79,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
2 AA00 11 G 15 NaN D 0301 2 ES-** P XXXX 2 50 F 1 201412 1 165000.0 5.2 1 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 45-49 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans ES ES-SENIOR EXECUTIVE SERVICE $160,000 - $169,999 1 Permanent 50-Senior Executive Service - Career 1 Full-time F - Full-time Nonseasonal
3 AA00 11 D 15 14 B 0905 1 GS-14 J XXXX 8 30 F 1 201412 1 106263.0 2.7 1 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 30-34 1 - 2 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $100,000 - $109,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
4 AA00 11 D 13 15 E 0341 2 GS-15 N XXXX 8 10 F 1 201412 1 141660.0 11.5 1 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 30-34 10 - 14 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0341-ADMINISTRATIVE OFFICER Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $140,000 - $149,999 1 Permanent 10-Competitive Service - Career 1 Full-time F - Full-time Nonseasonal
AGYSUB LOC AGELVL EDLVL GSEGRD LOSLVL OCC PATCO PPGRD SALLVL STEMOCC SUPERVIS TOA WORKSCH WORKSTAT DATECODE EMPLOYMENT SALARY LOS QTR AGYTYP AGYTYPT AGY AGYT AGYSUBT AGELVLT LOSLVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT OCCT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT SALLVLT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT
0 AA00 11 D 15 14 C 0905 1 GS-14 J XXXX 8 30 F 1 201503 1 107325.0 4.4 2 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 30-34 3 - 4 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $100,000 - $109,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
1 AA00 11 H 15 NaN G 0905 1 ES-** P XXXX 2 50 F 1 201503 1 165000.0 22.2 2 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 50-54 20 - 24 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans ES ES-SENIOR EXECUTIVE SERVICE $160,000 - $169,999 1 Permanent 50-Senior Executive Service - Career 1 Full-time F - Full-time Nonseasonal
2 AA00 11 K 21 15 J 0905 1 GS-15 O XXXX 8 30 F 1 201503 1 158700.0 40.4 2 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 65 or more 35 years or more 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $150,000 - $159,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
3 AA00 11 C 13 09 D 0301 2 GS-09 E XXXX 8 10 F 1 201503 1 54423.0 7.3 2 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 25-29 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $50,000 - $59,999 1 Permanent 10-Competitive Service - Career 1 Full-time F - Full-time Nonseasonal
4 AA00 11 D 15 15 C 0905 1 GS-15 L XXXX 8 30 F 1 201503 1 126245.0 4.5 2 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 30-34 3 - 4 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $120,000 - $129,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
AGYSUB LOC AGELVL EDLVL GSEGRD LOSLVL OCC PATCO PPGRD SALLVL STEMOCC SUPERVIS TOA WORKSCH WORKSTAT DATECODE EMPLOYMENT SALARY LOS QTR AGYTYP AGYTYPT AGY AGYT AGYSUBT AGELVLT LOSLVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT OCCT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT SALLVLT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT
0 AA00 11 D 15 14 C 0905 1 GS-14 K XXXX 8 30 F 1 201506 1 110902.0 3.2 3 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 30-34 3 - 4 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $110,000 - $119,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
1 AA00 11 C 13 09 D 0301 2 GS-09 E XXXX 8 10 F 1 201506 1 54423.0 7.5 3 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 25-29 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $50,000 - $59,999 1 Permanent 10-Competitive Service - Career 1 Full-time F - Full-time Nonseasonal
2 AA00 11 E 13 15 E 0341 2 GS-15 N XXXX 8 10 F 1 201506 1 143079.0 12.0 3 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 35-39 10 - 14 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0341-ADMINISTRATIVE OFFICER Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $140,000 - $149,999 1 Permanent 10-Competitive Service - Career 1 Full-time F - Full-time Nonseasonal
3 AA00 11 D 15 15 D 0905 1 GS-15 M XXXX 8 30 F 1 201506 1 130453.0 5.8 3 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 30-34 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $130,000 - $139,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
4 AA00 11 D 15 13 B 0905 1 GS-13 I XXXX 8 30 F 1 201506 1 90823.0 2.8 3 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 30-34 1 - 2 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $90,000 - $99,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
AGYSUB LOC AGELVL EDLVL GSEGRD LOSLVL OCC PATCO PPGRD SALLVL STEMOCC SUPERVIS TOA WORKSCH WORKSTAT DATECODE EMPLOYMENT SALARY LOS QTR AGYTYP AGYTYPT AGY AGYT AGYSUBT AGELVLT LOSLVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT OCCT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT SALLVLT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT
0 AA00 11 C 04 09 B 0301 2 GS-09 E XXXX 8 44 F 2 201509 1 52668.0 2.1 4 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 25-29 1 - 2 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $50,000 - $59,999 2 Non-permanent 44-Excepted Service - Schedule C 1 Full-time F - Full-time Nonseasonal
1 AA00 11 C 15 09 A 0904 1 GS-09 E XXXX 8 40 F 2 201509 1 52668.0 0.0 4 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 25-29 Less than 1 year 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0904-LAW CLERK Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $50,000 - $59,999 2 Non-permanent 40-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
2 AA00 11 D 15 15 D 0905 1 GS-15 M XXXX 8 30 F 1 201509 1 130453.0 6.0 4 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 30-34 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $130,000 - $139,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
3 AA00 11 K 21 15 J 0905 1 GS-15 O XXXX 8 30 F 1 201509 1 158700.0 40.9 4 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 65 or more 35 years or more 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $150,000 - $159,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
4 AA00 11 E 04 14 D 0905 1 GS-14 K XXXX 8 30 F 1 201509 1 118057.0 8.1 4 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 35-39 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $110,000 - $119,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time F - Full-time Nonseasonal
Original EMP data size of: 8196979
Removing 165501 Non-US observations.
Removing 710 observations with no specified Occupation.
Removing 13898 observations with no specified Salary.
Removing 21 observations with no specified Length of Service.
Removing 7937 observations of Age Level A.
Removing 1 observations with no specified Age Level.
New EMP data size of: 8008911
CPU times: user 6min 30s, sys: 17.2 s, total: 6min 47s
Wall time: 7min 36s
In [7]:
display(EMPDataOrig4Q.describe().transpose())
count mean std min 25% 50% 75% max
SALARY 8008911.0 80067.37279 37918.758366 15120.0 51437.0 74130.0 99957.0 401589.0
LOS 8008911.0 13.06029 10.446755 0.0 4.9 10.0 20.1 71.1
In [8]:
%matplotlib inline

#sns.boxplot(y = "SALARY", data = EMPDataOrig4Q)

With both our separation and non-separation data loaded, we calculate three new attributes through aggregation or calculation amongst various attributes.

1) SEP Count by Date & Occupation – total number of separations (of any type) for a given Date and Occupation;

2) SEP Count by Date & Location – total number of separations (of any type) for a given Date and Location;

3) Industry Average Salary – Average salary amongst non-separated employees, grouped by quarter, occupation, pay grade, and work schedule;

We proceed, by concatenating our Separation and Non-Separation observations, and merge these newly calculated attributes to the concatenated dataset.

In [9]:
%%time
%matplotlib inline

##Aggregate Number of Total Separations in current month for given Occ
AggSEPCount_EFDATE_OCC= pd.DataFrame({'SEPCount_EFDATE_OCC' : OPMDataMerged.groupby(["EFDATE", "OCC"]).size()}).reset_index()
display(AggSEPCount_EFDATE_OCC.head())


##Aggregate Number of Total Separations in current month for given LOC
AggSEPCount_EFDATE_LOC = pd.DataFrame({'SEPCount_EFDATE_LOC' : OPMDataMerged.groupby(["EFDATE", "LOC"]).size()}).reset_index()
display(AggSEPCount_EFDATE_LOC.head())

##Average Quarterly EMP Salary by occ 
AggIndAvgSalary = pd.DataFrame({'count' : EMPDataOrig4Q.groupby(["QTR", "OCC", "PPGRD", "WORKSCHT"]).size()}).reset_index()
AggIndAvgSalary2 = pd.DataFrame({'IndSalarySum' : EMPDataOrig4Q.groupby(["QTR", "OCC", "PPGRD", "WORKSCHT"])["SALARY"].sum()}).reset_index()
AggIndAvgSalary = AggIndAvgSalary.merge(AggIndAvgSalary2,on=["QTR", "OCC", "PPGRD", "WORKSCHT"])
AggIndAvgSalary["IndAvgSalary"] = AggIndAvgSalary["IndSalarySum"]/AggIndAvgSalary["count"]
del AggIndAvgSalary["count"]
del AggIndAvgSalary["IndSalarySum"]
display(AggIndAvgSalary.head())
EFDATE OCC SEPCount_EFDATE_OCC
0 201410 0006 20
1 201410 0007 89
2 201410 0017 1
3 201410 0018 33
4 201410 0019 1
EFDATE LOC SEPCount_EFDATE_LOC
0 201410 01 239
1 201410 02 261
2 201410 04 499
3 201410 05 132
4 201410 06 1926
QTR OCC PPGRD WORKSCHT IndAvgSalary
0 1 0006 ES-** Full-time Nonseasonal 161827.273973
1 1 0006 GL-09 Full-time Nonseasonal 63970.126984
2 1 0006 GS-09 Full-time Nonseasonal 56876.500000
3 1 0006 GS-11 Full-time Nonseasonal 72865.783673
4 1 0006 GS-12 Full-time Nonseasonal 85742.663717
CPU times: user 3.08 s, sys: 116 ms, total: 3.2 s
Wall time: 3.19 s
In [10]:
#Merge Two Datasets
### NS SEP code means NonSeparation
###add hardcoded null value columns where applicable
EMPDataOrig4Q["SEP"] = "NS"
EMPDataOrig4Q["GENDER"] = np.nan
EMPDataOrig4Q["COUNT"] = np.nan

OPMDataMerged["DATECODE"] = OPMDataMerged["EFDATE"]

OPMColList = ["AGYSUB", "SEP", "DATECODE",   "AGELVL", "GENDER", "GSEGRD", "LOSLVL", "LOC", "OCC", "PATCO", "PPGRD", "SALLVL", "TOA", "WORKSCH", "COUNT", "SALARY", "LOS", "AGYTYP", "AGYTYPT", "AGY", "AGYT", "AGYSUBT", "QTR", "AGELVLT", "LOSLVLT", "LOCTYP", "LOCTYPT", "LOCT", "OCCTYP", "OCCTYPT", "OCCFAM", "OCCFAMT", "OCCT", "PATCOT", "PPTYP", "PPTYPT", "PPGROUP", "PPGROUPT", "PAYPLAN", "PAYPLANT", "SALLVLT", "TOATYP", "TOATYPT", "TOAT", "WSTYP", "WSTYPT", "WORKSCHT"]
EMPColList = ["AGYSUB", "SEP", "DATECODE", "AGELVL", "GENDER", "GSEGRD", "LOSLVL", "LOC", "OCC", "PATCO", "PPGRD", "SALLVL", "TOA", "WORKSCH", "COUNT", "SALARY", "LOS", "AGYTYP", "AGYTYPT", "AGY", "AGYT", "AGYSUBT", "QTR", "AGELVLT", "LOSLVLT", "LOCTYP", "LOCTYPT", "LOCT", "OCCTYP", "OCCTYPT", "OCCFAM", "OCCFAMT", "OCCT", "PATCOT", "PPTYP", "PPTYPT", "PPGROUP", "PPGROUPT", "PAYPLAN", "PAYPLANT", "SALLVLT", "TOATYP", "TOATYPT", "TOAT", "WSTYP", "WSTYPT", "WORKSCHT"]

OPMDataMerged = pd.concat([OPMDataMerged[OPMColList], EMPDataOrig4Q[EMPColList]], ignore_index=True)
print("Total concatenated data size for SEP and non-SEP: "+str(len(OPMDataMerged)))

OPMDataMerged = OPMDataMerged.merge(AggSEPCount_EFDATE_OCC, left_on = ['DATECODE','OCC'], right_on = ['EFDATE','OCC'], how = 'left')
OPMDataMerged = OPMDataMerged.merge(AggSEPCount_EFDATE_LOC, left_on = ['DATECODE','LOC'], right_on = ['EFDATE','LOC'], how = 'left')
OPMDataMerged = OPMDataMerged.merge(AggIndAvgSalary, on = ['QTR','OCC', 'PPGRD', 'WORKSCHT'], how = 'left')
OPMDataMerged["SalaryOverUnderIndAvg"] = OPMDataMerged["SALARY"] - OPMDataMerged["IndAvgSalary"]

del OPMDataMerged["EFDATE_x"]
del OPMDataMerged["EFDATE_y"]

display(OPMDataMerged.head())
display(OPMDataMerged.tail())
Total concatenated data size for SEP and non-SEP: 8223193
AGYSUB SEP DATECODE AGELVL GENDER GSEGRD LOSLVL LOC OCC PATCO PPGRD SALLVL TOA WORKSCH COUNT SALARY LOS AGYTYP AGYTYPT AGY AGYT AGYSUBT QTR AGELVLT LOSLVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT OCCT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT SALLVLT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg
0 AA00 SC 201507 C M 11 A 11 0905 1 GS-11 F 40 F 1.0 63722.0 0.8 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 4 25-29 Less than 1 year 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $60,000 - $69,999 2 Non-permanent 40-Excepted Service - Schedule A 1 Full-time Full-time Nonseasonal 205.0 1319 64540.593830 -818.593830
1 AA00 SC 201506 D F 15 C 11 0905 1 GS-15 L 30 F 1.0 126245.0 4.8 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 3 30-34 3 - 4 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $120,000 - $129,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time Full-time Nonseasonal 207.0 1132 149864.298504 -23619.298504
2 AF** SA 201503 H M 11 C 48 2210 2 GS-11 F 10 F 1.0 66585.0 4.9 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF**-INVALID 2 50-54 3 - 4 years 1 United States 48-TEXAS 1 White Collar 22 22xx-INFORMATION TECHNOLOGY 2210-INFORMATION TECHNOLOGY MANAGEMENT Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $60,000 - $69,999 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 439.0 1087 71530.963755 -4945.963755
3 AF02 SD 201506 I M 15 J 35 0301 2 GS-15 O 10 F 1.0 156737.0 39.8 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF02-AIR FORCE INSPECTION AGENCY (FO) 3 55-59 35 years or more 1 United States 35-NEW MEXICO 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $150,000 - $159,999 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 670.0 265 146735.220304 10001.779696
4 AF03 SC 201509 H M 13 B 06 0301 2 GS-13 I 15 F 1.0 92973.0 1.0 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF03-AIR FORCE OPERATIONAL TEST AND EVALUATION... 4 50-54 1 - 2 years 1 United States 06-CALIFORNIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $90,000 - $99,999 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 721.0 1853 101641.124025 -8668.124025
AGYSUB SEP DATECODE AGELVL GENDER GSEGRD LOSLVL LOC OCC PATCO PPGRD SALLVL TOA WORKSCH COUNT SALARY LOS AGYTYP AGYTYPT AGY AGYT AGYSUBT QTR AGELVLT LOSLVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT OCCT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT SALLVLT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg
8223188 ZU00 NS 201509 D NaN NaN C 11 0301 2 AD-00 G 48 F NaN 76377.0 4.8 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 30-34 3 - 4 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $70,000 - $79,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 -39463.182250
8223189 ZU00 NS 201509 K NaN NaN D 11 0301 2 AD-00 M 48 F NaN 139517.0 7.0 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 65 or more 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $130,000 - $139,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 23676.817750
8223190 ZU00 NS 201509 K NaN NaN D 11 0301 2 AD-00 O 48 F NaN 158671.0 7.0 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 65 or more 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $150,000 - $159,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 42830.817750
8223191 ZU00 NS 201509 B NaN NaN B 11 0301 2 AD-00 C 48 F NaN 36244.0 1.6 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 20-24 1 - 2 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $30,000 - $39,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 -79596.182250
8223192 ZU00 NS 201509 E NaN NaN D 11 0505 2 AD-00 I 48 F NaN 99288.0 5.0 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 35-39 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 05 05xx-ACCOUNTING AND BUDGET 0505-FINANCIAL MANAGEMENT Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $90,000 - $99,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 7.0 1391 148382.833333 -49094.833333
In [11]:
print(len(OPMDataMerged[OPMDataMerged["SEPCount_EFDATE_OCC"].isnull()]))

display(OPMDataMerged[OPMDataMerged["SEPCount_EFDATE_OCC"].isnull()][["SEP","DATECODE", "OCC"]].drop_duplicates())
50993
SEP DATECODE OCC
217479 NS 201412 7402
217582 NS 201412 7420
217603 NS 201412 1051
217663 NS 201412 1054
218685 NS 201412 2504
218871 NS 201412 8201
218999 NS 201412 4104
219003 NS 201412 4715
219135 NS 201412 0698
220085 NS 201412 0019
220426 NS 201412 3602
221497 NS 201412 2608
221637 NS 201412 3725
224242 NS 201412 6968
225410 NS 201412 0392
226132 NS 201412 3606
228440 NS 201412 2601
231003 NS 201412 3940
231189 NS 201412 5439
246316 NS 201412 1725
246379 NS 201412 5317
246874 NS 201412 5737
247551 NS 201412 1386
254687 NS 201412 0394
259606 NS 201412 4819
264047 NS 201412 2144
266228 NS 201412 1056
268830 NS 201412 5736
270371 NS 201412 0021
271810 NS 201412 3872
271986 NS 201412 4301
273244 NS 201412 3701
273326 NS 201412 6656
273665 NS 201412 8601
275118 NS 201412 3858
275185 NS 201412 4745
277206 NS 201412 4816
279195 NS 201412 1699
284633 NS 201412 5423
289472 NS 201412 1321
295232 NS 201412 3727
305466 NS 201412 1521
319867 NS 201412 0642
325062 NS 201412 4373
332634 NS 201412 2110
349140 NS 201412 0134
376747 NS 201412 0435
377161 NS 201412 1382
380400 NS 201412 0440
380444 NS 201412 0890
380485 NS 201412 1221
380534 NS 201412 0799
380549 NS 201412 0471
381610 NS 201412 5002
381660 NS 201412 0302
382823 NS 201412 4737
383757 NS 201412 1384
387315 NS 201412 3511
395468 NS 201412 1380
407172 NS 201412 0880
417564 NS 201412 1202
422349 NS 201412 0184
431266 NS 201412 5729
444970 NS 201412 3515
455591 NS 201412 4414
456269 NS 201412 1850
457198 NS 201412 0160
461672 NS 201412 0136
475474 NS 201412 1374
475784 NS 201412 6517
475903 NS 201412 6605
480003 NS 201412 5310
485323 NS 201412 3605
500481 NS 201412 4741
503007 NS 201412 1397
505340 NS 201412 3314
506488 NS 201412 5323
516998 NS 201412 4101
519478 NS 201412 0322
558933 NS 201412 4010
559676 NS 201412 0648
593579 NS 201412 3301
596650 NS 201412 3101
625941 NS 201412 7603
661444 NS 201412 4807
661550 NS 201412 3428
662101 NS 201412 5738
676144 NS 201412 5205
685955 NS 201412 6505
686300 NS 201412 3546
686445 NS 201412 5427
704369 NS 201412 2161
709277 NS 201412 9927
709522 NS 201412 9968
711524 NS 201412 9944
711608 NS 201412 9916
711636 NS 201412 9957
712047 NS 201412 9960
712126 NS 201412 9971
722081 NS 201412 1226
722670 NS 201412 1223
725221 NS 201412 1299
766570 NS 201412 1999
859013 NS 201412 1541
966674 NS 201412 0106
966676 NS 201412 0243
966716 NS 201412 0140
971520 NS 201412 0357
1087460 NS 201412 1046
1145227 NS 201412 4717
1345880 NS 201412 2501
1359410 NS 201412 3910
1363515 NS 201412 9961
1384266 NS 201412 9905
1389961 NS 201412 5419
1502113 NS 201412 3604
1503521 NS 201412 3808
1528597 NS 201412 1021
1534692 NS 201412 9942
1534710 NS 201412 9997
1534741 NS 201412 9975
1534742 NS 201412 9930
1534756 NS 201412 9945
1534808 NS 201412 9972
1534811 NS 201412 9995
1534822 NS 201412 9940
1534917 NS 201412 9993
1534931 NS 201412 9982
1535050 NS 201412 9999
1535060 NS 201412 9915
1535179 NS 201412 9955
1535640 NS 201412 9919
1535643 NS 201412 9918
1536227 NS 201412 9914
1537487 NS 201412 9921
1538507 NS 201412 9903
1562285 NS 201412 5221
1620441 NS 201412 1831
1846895 NS 201412 5440
1846946 NS 201412 3513
1848770 NS 201412 4406
1848778 NS 201412 4454
1872827 NS 201412 0593
1906935 NS 201412 0625
1937597 NS 201412 0637
2209093 NS 201503 1054
2209135 NS 201503 0050
2209165 NS 201503 7420
2209549 NS 201503 4805
2209562 NS 201503 7401
2209567 NS 201503 0062
2209830 NS 201503 1051
2210138 NS 201503 5767
2210218 NS 201503 2504
2210251 NS 201503 3940
2210298 NS 201503 4715
2210777 NS 201503 0017
2210790 NS 201503 0019
2210923 NS 201503 5026
2211013 NS 201503 3602
2211337 NS 201503 4255
2211397 NS 201503 3606
2211430 NS 201503 3809
2211601 NS 201503 1501
2211622 NS 201503 2608
2211840 NS 201503 3901
2211850 NS 201503 0698
2212525 NS 201503 0667
2213232 NS 201503 8610
2216698 NS 201503 4605
2216720 NS 201503 5439
2217256 NS 201503 1015
2221243 NS 201503 3725
2237081 NS 201503 5737
2237389 NS 201503 5317
2237807 NS 201503 1725
2237838 NS 201503 0131
2238836 NS 201503 1386
2239148 NS 201503 4417
2239413 NS 201503 4401
2245721 NS 201503 5876
2247270 NS 201503 4201
2250530 NS 201503 2135
2251137 NS 201503 0394
2254034 NS 201503 4819
2255121 NS 201503 7001
2255147 NS 201503 0021
2255494 NS 201503 2144
2257059 NS 201503 4602
2257560 NS 201503 4601
2258388 NS 201503 1056
2262895 NS 201503 3769
2262896 NS 201503 3707
2262911 NS 201503 4850
2263867 NS 201503 6656
2264417 NS 201503 4745
2264609 NS 201503 3872
2264927 NS 201503 4616
2265294 NS 201503 4301
2266492 NS 201503 8601
2267324 NS 201503 3727
2268045 NS 201503 7006
2269170 NS 201503 3712
2269386 NS 201503 2032
2273426 NS 201503 1521
2273438 NS 201503 0688
2274894 NS 201503 3858
2278199 NS 201503 4373
2282112 NS 201503 3401
2290158 NS 201503 4816
2305124 NS 201503 1321
2312141 NS 201503 5313
2319977 NS 201503 0642
2322021 NS 201503 1372
2326923 NS 201503 2110
2332047 NS 201503 1815
2367236 NS 201503 1146
2367963 NS 201503 1382
2368576 NS 201503 0435
2371118 NS 201503 0471
2371219 NS 201503 0487
2371445 NS 201503 1221
2373151 NS 201503 5002
2373803 NS 201503 0799
2373812 NS 201503 1384
2373939 NS 201503 0302
2375066 NS 201503 5001
2382890 NS 201503 0135
2383750 NS 201503 1380
2394962 NS 201503 5786
2398815 NS 201503 1202
2421724 NS 201503 5729
2422326 NS 201503 0309
2429714 NS 201503 3511
2435949 NS 201503 3515
2445822 NS 201503 4414
2445959 NS 201503 4402
2446076 NS 201503 1850
2446395 NS 201503 0160
2451711 NS 201503 0136
2465677 NS 201503 6517
2465911 NS 201503 1374
2469001 NS 201503 7601
2469608 NS 201503 5310
2473512 NS 201503 3605
2478500 NS 201503 4741
2481863 NS 201503 1630
2487747 NS 201503 5042
2492992 NS 201503 1397
2493609 NS 201503 5318
... ... ... ...
4493254 NS 201506 3605
4500627 NS 201506 5784
4501690 NS 201506 5323
4505453 NS 201506 3314
4511881 NS 201506 0322
4567936 NS 201506 0313
4590882 NS 201506 4754
4591246 NS 201506 3101
4618223 NS 201506 3301
4625645 NS 201506 7603
4647430 NS 201506 3106
4648437 NS 201506 4101
4656879 NS 201506 0873
4659315 NS 201506 5738
4660107 NS 201506 3802
4666481 NS 201506 4807
4673196 NS 201506 5205
4683095 NS 201506 6505
4683209 NS 201506 3546
4683500 NS 201506 5427
4706313 NS 201506 9924
4706406 NS 201506 9954
4706448 NS 201506 9923
4706828 NS 201506 9932
4706944 NS 201506 9920
4706992 NS 201506 9916
4707380 NS 201506 9971
4707627 NS 201506 9960
4711283 NS 201506 9944
4718825 NS 201506 1226
4719354 NS 201506 1223
4720834 NS 201506 1299
4838452 NS 201506 1163
4941776 NS 201506 0082
4967369 NS 201506 0140
4967373 NS 201506 0243
4967385 NS 201506 0106
4971515 NS 201506 0357
5079100 NS 201506 1046
5238592 NS 201506 1889
5364837 NS 201506 9905
5366360 NS 201506 3910
5379475 NS 201506 9961
5397362 NS 201506 4416
5397878 NS 201506 5419
5515876 NS 201506 3808
5525861 NS 201506 1021
5543637 NS 201506 9982
5543654 NS 201506 9942
5543662 NS 201506 9997
5543680 NS 201506 9955
5543695 NS 201506 9975
5543698 NS 201506 9906
5543700 NS 201506 9991
5543706 NS 201506 9908
5543725 NS 201506 9930
5543741 NS 201506 9976
5543817 NS 201506 9919
5543902 NS 201506 9999
5543929 NS 201506 9929
5544204 NS 201506 9940
5544280 NS 201506 9939
5544281 NS 201506 9914
5544294 NS 201506 9918
5544388 NS 201506 9915
5545181 NS 201506 9921
5560038 NS 201506 9904
5570248 NS 201506 5221
5612184 NS 201506 3428
5631455 NS 201506 1831
5708205 NS 201506 2125
5846894 NS 201506 5440
5847347 NS 201506 3513
5848754 NS 201506 4406
5848756 NS 201506 4441
5848778 NS 201506 4454
5848780 NS 201506 4449
5874147 NS 201506 0593
5899552 NS 201506 0625
5915763 NS 201506 0637
6215137 NS 201509 7420
6215177 NS 201509 1054
6215183 NS 201509 0062
6215311 NS 201509 1051
6216048 NS 201509 0319
6216143 NS 201509 0332
6216452 NS 201509 5026
6216656 NS 201509 3602
6216697 NS 201509 4255
6216714 NS 201509 3610
6216843 NS 201509 4715
6217038 NS 201509 8610
6217069 NS 201509 1501
6217109 NS 201509 4714
6217177 NS 201509 8201
6218219 NS 201509 0017
6221598 NS 201509 5439
6221611 NS 201509 6511
6222813 NS 201509 3901
6223238 NS 201509 6968
6224360 NS 201509 3725
6242835 NS 201509 5317
6242837 NS 201509 7305
6243763 NS 201509 3111
6244182 NS 201509 1725
6244772 NS 201509 3606
6245272 NS 201509 4401
6245800 NS 201509 1386
6246495 NS 201509 1815
6248929 NS 201509 1361
6251962 NS 201509 4201
6252800 NS 201509 5401
6253625 NS 201509 5737
6260660 NS 201509 4819
6261977 NS 201509 7001
6262910 NS 201509 2144
6265927 NS 201509 0021
6266002 NS 201509 4602
6269692 NS 201509 0967
6269891 NS 201509 4840
6270010 NS 201509 3707
6270144 NS 201509 5423
6270200 NS 201509 6656
6270425 NS 201509 3872
6270432 NS 201509 4745
6270882 NS 201509 4361
6271499 NS 201509 3701
6271656 NS 201509 3727
6272327 NS 201509 4816
6272630 NS 201509 4850
6273748 NS 201509 3858
6274513 NS 201509 1222
6276289 NS 201509 4616
6276382 NS 201509 4301
6277379 NS 201509 7006
6280721 NS 201509 8601
6308684 NS 201509 1321
6325366 NS 201509 1056
6327220 NS 201509 1521
6334577 NS 201509 2110
6346443 NS 201509 4417
6377693 NS 201509 0434
6378110 NS 201509 1999
6378827 NS 201509 0440
6378853 NS 201509 0487
6378954 NS 201509 0437
6379113 NS 201509 5002
6379245 NS 201509 1384
6380466 NS 201509 0410
6381133 NS 201509 0308
6381165 NS 201509 0302
6391135 NS 201509 0135
6392086 NS 201509 1380
6392140 NS 201509 0965
6405195 NS 201509 0880
6428400 NS 201509 1202
6433233 NS 201509 5729
6435611 NS 201509 0309
6438731 NS 201509 0184
6453143 NS 201509 3515
6464352 NS 201509 4414
6464503 NS 201509 1850
6470218 NS 201509 0136
6483824 NS 201509 1374
6485608 NS 201509 2501
6488173 NS 201509 1630
6490604 NS 201509 5310
6490614 NS 201509 4741
6509539 NS 201509 3605
6511085 NS 201509 0072
6511515 NS 201509 1397
6512110 NS 201509 5782
6512216 NS 201509 5323
6516672 NS 201509 3314
6584508 NS 201509 0635
6593181 NS 201509 0313
6606313 NS 201509 3101
6628632 NS 201509 3301
6639321 NS 201509 7603
6651971 NS 201509 1046
6658080 NS 201509 3106
6658384 NS 201509 4101
6666886 NS 201509 0873
6667185 NS 201509 4373
6669581 NS 201509 4807
6670232 NS 201509 5738
6672326 NS 201509 3802
6693020 NS 201509 5427
6711618 NS 201509 2161
6713304 NS 201509 0958
6716480 NS 201509 9973
6716659 NS 201509 9916
6716801 NS 201509 9932
6717186 NS 201509 9971
6718325 NS 201509 9965
6719926 NS 201509 9960
6729094 NS 201509 1226
6729105 NS 201509 1223
6731464 NS 201509 1299
6741881 NS 201509 5313
6803145 NS 201509 1730
6820061 NS 201509 6941
6847199 NS 201509 1163
6868697 NS 201509 1541
6976225 NS 201509 0140
6976233 NS 201509 0106
6976245 NS 201509 0243
6980096 NS 201509 0357
7155776 NS 201509 4717
7247692 NS 201509 1881
7371930 NS 201509 3910
7376502 NS 201509 9961
7392449 NS 201509 0485
7404982 NS 201509 4403
7405044 NS 201509 4416
7405049 NS 201509 5419
7521217 NS 201509 3808
7524148 NS 201509 3604
7536239 NS 201509 1021
7552743 NS 201509 9991
7552774 NS 201509 9975
7552775 NS 201509 9998
7552813 NS 201509 9976
7552825 NS 201509 9994
7552870 NS 201509 9988
7552961 NS 201509 9940
7552962 NS 201509 9982
7553012 NS 201509 9915
7553021 NS 201509 9930
7553042 NS 201509 9929
7553061 NS 201509 9999
7553126 NS 201509 9993
7553127 NS 201509 9908
7553159 NS 201509 9955
7553740 NS 201509 9914
7553846 NS 201509 9939
7554073 NS 201509 9921
7554679 NS 201509 9919
7554855 NS 201509 9918
7569222 NS 201509 9904
7579652 NS 201509 5221
7612169 NS 201509 3428
7717999 NS 201509 2125
7851216 NS 201509 5440
7851838 NS 201509 3513
7853201 NS 201509 4406
7853227 NS 201509 4454
7884942 NS 201509 0593
7965633 NS 201509 0625
8001140 NS 201509 0637

660 rows × 3 columns

These 50993 Non-Separation observations do not have coverage within the Separation Dataset, thus, we will remove these observations as out of scope demographic in our analysis. Any attempt in predicting these values will not have enough data to support a significant response.

In [12]:
OPMDataMerged = OPMDataMerged[OPMDataMerged["SEPCount_EFDATE_OCC"].notnull()]

print(len(OPMDataMerged[OPMDataMerged["SEPCount_EFDATE_OCC"].isnull()]))

print(len(OPMDataMerged))
0
8172200
In [13]:
print(len(OPMDataMerged[OPMDataMerged["SEPCount_EFDATE_LOC"].isnull()]))

display(OPMDataMerged[OPMDataMerged["SEPCount_EFDATE_LOC"].isnull()][["SEP","DATECODE","LOC"]].drop_duplicates())
0
SEP DATECODE LOC
In [14]:
print(len(OPMDataMerged[OPMDataMerged["IndAvgSalary"].isnull()]))

display(OPMDataMerged[OPMDataMerged["IndAvgSalary"].isnull()][["QTR", "SEP","OCCT", "PPGRD", "WORKSCHT"]].drop_duplicates())
1293
QTR SEP OCCT PPGRD WORKSCHT
257 4 SC 7401-MISC FOOD PREPARATION AND SERVING WG-01 Full-time Nonseasonal
627 4 SC 1301-GENERAL PHYSICAL SCIENCE AD-24 Part-time Nonseasonal
697 4 SJ 0199-SOCIAL SCIENCE STUDENT TRAINEE GS-02 Intermittent Nonseasonal
749 4 SC 3940-BROADCASTING EQUIPMENT OPERATING WG-10 Full-time Nonseasonal
2401 4 SJ 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... GS-02 Intermittent Seasonal
3412 2 SC 5003-GARDENING WG-04 Full-time Seasonal
3471 1 SA 5003-GARDENING WG-04 Full-time Seasonal
3551 3 SD 5716-ENGINEERING EQUIPMENT OPERATING WS-14 Full-time Nonseasonal
4937 3 SC 0819-ENVIRONMENTAL ENGINEERING GS-11 Part-time Job Sharer Nonseasonal
5285 1 SD 5716-ENGINEERING EQUIPMENT OPERATING WG-08 Intermittent Seasonal
5363 4 SJ 0189-RECREATION AID AND ASSISTANT GS-03 Intermittent Nonseasonal
5763 1 SD 2005-SUPPLY CLERICAL AND TECHNICIAN GS-04 Part-time Job Sharer Nonseasonal
6079 3 SC 0180-PSYCHOLOGY NH-02 Full-time Nonseasonal
6957 3 SD 0810-CIVIL ENGINEERING DR-03 Full-time Nonseasonal
7015 1 SA 1306-HEALTH PHYSICS DR-01 Full-time Nonseasonal
7376 4 SC 1699-EQUIPMENT AND FACILITIES MANAGEMENT STUDE... GS-05 Full-time Nonseasonal
7395 3 SC 0599-FINANCIAL MANAGEMENT STUDENT TRAINEE DU-01 Full-time Nonseasonal
7464 3 SD 3769-SHOT PEENING MACHINE OPERATING WS-07 Full-time Nonseasonal
7512 4 SC 0840-NUCLEAR ENGINEERING DR-03 Full-time Nonseasonal
7675 2 SD 4714-MODEL MAKING WL-15 Full-time Nonseasonal
7727 4 SC 0189-RECREATION AID AND ASSISTANT GS-02 Part-time Seasonal
7877 4 SJ 0189-RECREATION AID AND ASSISTANT GS-02 Part-time Seasonal
8054 4 SC 0665-SPEECH PATHOLOGY AND AUDIOLOGY DR-03 Full-time Nonseasonal
8160 4 SC 4102-PAINTING WG-05 Part-time Seasonal
8216 2 SA 5725-CRANE OPERATING WG-08 Full-time Nonseasonal
8320 4 SD 5401-MISC INDUSTRIAL EQUIPMENT OPERATION WS-11 Full-time Nonseasonal
8325 1 SC 0189-RECREATION AID AND ASSISTANT DU-01 Part-time Nonseasonal
8389 4 SD 3705-NON-DESTRUCTIVE TESTING WS-16 Full-time Nonseasonal
8435 4 SD 1330-ASTRONOMY AND SPACE SCIENCE DR-04 Full-time Nonseasonal
8449 1 SK 4102-PAINTING WG-05 Part-time Seasonal
9741 4 SC 0189-RECREATION AID AND ASSISTANT GS-03 Part-time Seasonal
9890 1 SI 8801-MISCELLANEOUS AIRCRAFT OVERHAUL WG-08 Part-time Nonseasonal
9903 1 SD 0610-NURSE DR-01 Full-time Nonseasonal
9916 2 SD 0130-FOREIGN AFFAIRS DO-02 Full-time Nonseasonal
10139 4 SC 6901-MISC WAREHOUSING AND STOCK HANDLING WG-06 Part-time Job Sharer Nonseasonal
10656 2 SC 5309-HEATING & BOILER PLANT EQUIPMT MECHANIC WL-11 Full-time Nonseasonal
11491 4 SC 1008-INTERIOR DESIGN GG-12 Full-time Nonseasonal
11516 4 SI 0854-COMPUTER ENGINEERING GG-11 Full-time Nonseasonal
11941 1 SJ 0201-HUMAN RESOURCES MANAGEMENT GS-06 Full-time Nonseasonal
12265 4 SJ 0335-COMPUTER CLERK AND ASSISTANT GS-09 Part-time Nonseasonal
12541 2 SJ 6501-MISC AMMUN, EXPLOSIVES, & TOXIC MATER WORK WG-12 Full-time Nonseasonal
13028 3 SJ 5378-POWERED SUPPORT SYSTEMS MECHANIC WG-10 Part-time Nonseasonal
13029 2 SJ 2610-ELECTRONIC INTEGRATED SYSTEMS MECHANIC WG-12 Part-time Nonseasonal
13643 3 SJ 8602-AIRCRAFT ENGINE MECHANIC WG-04 Full-time Nonseasonal
14084 3 SJ 0340-PROGRAM MANAGEMENT GS-14 Part-time Nonseasonal
15523 2 SJ 2892-AIRCRAFT ELECTRICIAN WG-06 Full-time Nonseasonal
16075 3 SJ 2892-AIRCRAFT ELECTRICIAN WG-07 Full-time Nonseasonal
16454 2 SC 5378-POWERED SUPPORT SYSTEMS MECHANIC WG-06 Full-time Nonseasonal
16512 1 SJ 8602-AIRCRAFT ENGINE MECHANIC WG-06 Full-time Nonseasonal
16691 2 SJ 0132-INTELLIGENCE GS-04 Full-time Nonseasonal
17344 1 SJ 2101-TRANSPORTATION SPECIALIST GS-07 Intermittent Nonseasonal
17376 4 SJ 0335-COMPUTER CLERK AND ASSISTANT GS-07 Part-time Nonseasonal
17426 3 SC 0335-COMPUTER CLERK AND ASSISTANT GS-06 Intermittent Nonseasonal
17464 3 SJ 8852-AIRCRAFT MECHANIC WG-04 Full-time Nonseasonal
17763 1 SJ 4818-AIRCRAFT SURVIVAL FLIGHT EQUIPMENT REPAIR WG-10 Part-time Nonseasonal
19309 4 SD 0701-VETERINARY MEDICAL SCIENCE GM-15 Full-time Nonseasonal
19312 3 SD 0410-ZOOLOGY ST-00 Full-time Nonseasonal
19704 2 SJ 3511-LABORATORY WORKING WG-01 Part-time Nonseasonal
19768 2 SD 0435-PLANT PHYSIOLOGY GM-15 Full-time Nonseasonal
20138 3 SC 0802-ENGINEERING TECHNICAL GS-03 Part-time Nonseasonal
20285 4 SC 3566-CUSTODIAL WORKING WG-01 Intermittent Seasonal
20720 2 SC 0135-FOREIGN AGRICULTURAL AFFAIRS FP-03 Full-time Nonseasonal
20754 4 SJ 0119-ECONOMICS ASSISTANT GS-03 Full-time Seasonal
20760 4 SJ 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... GS-04 Full-time Seasonal
20777 3 SD 0135-FOREIGN AGRICULTURAL AFFAIRS FE-01 Full-time Nonseasonal
20878 1 SJ 0189-RECREATION AID AND ASSISTANT GS-01 Full-time Nonseasonal
20928 1 SJ 0462-FORESTRY TECHNICIAN GS-04 Intermittent Seasonal
21599 3 SJ 0455-RANGE TECHNICIAN GS-05 Part-time Nonseasonal
23681 1 SJ 0102-SOCIAL SCIENCE AID AND TECHNICIAN GS-03 Full-time Nonseasonal
24266 1 SI 8610-SMALL ENGINE MECHANIC WG-06 Full-time Seasonal
24310 3 SJ 0455-RANGE TECHNICIAN GS-06 Intermittent Nonseasonal
26446 1 SC 0304-INFORMATION RECEPTIONIST GS-04 Intermittent Nonseasonal
27067 1 SJ 0455-RANGE TECHNICIAN GS-05 Intermittent Seasonal
29585 2 SD 1071-AUDIOVISUAL PRODUCTION GM-13 Full-time Nonseasonal
29689 2 SJ 0802-ENGINEERING TECHNICAL GS-06 Part-time Seasonal
29724 2 SJ 0462-FORESTRY TECHNICIAN GS-01 Full-time Nonseasonal
29878 1 SJ 1001-GENERAL ARTS AND INFORMATION GS-05 Intermittent Seasonal
30106 1 SJ 5201-MISCELLANEOUS OCCUPATIONS WG-05 Intermittent Nonseasonal
30137 3 SA 0430-BOTANY GS-07 Full-time Nonseasonal
30156 2 SJ 0102-SOCIAL SCIENCE AID AND TECHNICIAN GS-04 Intermittent Nonseasonal
30803 4 SC 0189-RECREATION AID AND ASSISTANT GS-04 Intermittent Nonseasonal
30817 3 SJ 1341-METEOROLOGICAL TECHNICIAN GS-08 Part-time Nonseasonal
32058 1 SJ 1371-CARTOGRAPHIC TECHNICIAN GS-07 Intermittent Nonseasonal
32562 1 SC 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... GS-02 Full-time Seasonal
33011 1 SC 0335-COMPUTER CLERK AND ASSISTANT GS-07 Part-time Nonseasonal
33713 3 SD 4715-EXHIBITS MAKING/MODELING WL-07 Full-time Nonseasonal
33737 4 SD 0850-ELECTRICAL ENGINEERING GM-14 Full-time Nonseasonal
33870 4 SJ 0318-SECRETARY GS-03 Intermittent Nonseasonal
34226 1 SC 0322-CLERK-TYPIST GS-04 Intermittent Nonseasonal
35280 3 SJ 0486-WILDLIFE BIOLOGY AD-00 Intermittent Nonseasonal
35308 3 SJ 1421-ARCHIVES TECHNICIAN GS-07 Full-time Seasonal
35369 2 SJ 0326-OFFICE AUTOMATION CLERICAL AND ASSISTANCE GS-04 Intermittent Seasonal
35683 3 SC 0421-PLANT PROTECTION TECHNICIAN GS-05 Intermittent Nonseasonal
35733 4 SJ 1421-ARCHIVES TECHNICIAN GS-07 Intermittent Nonseasonal
35779 4 SJ 0404-BIOLOGICAL SCIENCE TECHNICIAN AD-00 Part-time Seasonal
36150 3 SC 1863-FOOD INSPECTION GS-08 Intermittent Nonseasonal
36341 4 SC 1899-INVESTIGATION STUDENT TRAINEE GS-03 Full-time Nonseasonal
36424 2 SD 0896-INDUSTRIAL ENGINEERING GM-13 Full-time Nonseasonal
36788 2 SD 0935-ADMINISTRATIVE LAW JUDGE AL-02 Full-time Nonseasonal
37280 1 SG 0905-GENERAL ATTORNEY FE-02 Full-time Nonseasonal
37464 3 SJ 1140-TRADE SPECIALIST GS-15 Intermittent Nonseasonal
37478 4 SG 0130-FOREIGN AFFAIRS FE-03 Full-time Nonseasonal
37721 1 SC 0809-CONSTRUCTION CONTROL TECHNICAL NJ-03 Full-time Nonseasonal
37845 4 SC 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM NH-02 Part-time Nonseasonal
38023 1 SC 0318-SECRETARY NK-02 Part-time Nonseasonal
38509 1 SC 0184-SOCIOLOGY GG-12 Full-time Nonseasonal
38675 2 SD 0896-INDUSTRIAL ENGINEERING GG-13 Full-time Nonseasonal
39900 3 SJ 5803-HEAVY MOBILE EQUIPMENT MECHANIC WG-08 Intermittent Nonseasonal
40629 4 SC 6904-TOOLS AND PARTS ATTENDING WG-02 Full-time Nonseasonal
41443 4 SD 5407-ELECTRICAL POWER CONTROLLING WG-08 Full-time Nonseasonal
41565 2 SJ 0085-SECURITY GUARD GS-06 Part-time Nonseasonal
41658 1 SJ 5705-TRACTOR OPERATING WL-04 Intermittent Nonseasonal
41759 1 SD 6610-SMALL ARMS REPAIRING WL-09 Full-time Nonseasonal
41774 1 SI 0072-FINGERPRINT IDENTIFICATION GS-09 Full-time Nonseasonal
41961 1 SC 0085-SECURITY GUARD GS-03 Intermittent Nonseasonal
42017 3 SJ 5716-ENGINEERING EQUIPMENT OPERATING WL-08 Intermittent Nonseasonal
42082 1 SJ 5784-RIVERBOAT OPERATING XH-14 Full-time Seasonal
42267 1 SC 0802-ENGINEERING TECHNICAL GS-10 Intermittent Nonseasonal
42308 3 SJ 6907-MATERIALS HANDLER WG-06 Intermittent Nonseasonal
42324 1 SJ 5786-SMALL CRAFT OPERATING WG-08 Intermittent Nonseasonal
42328 1 SC 0856-ELECTRONICS TECHNICAL GS-09 Intermittent Nonseasonal
42414 2 SC 1699-EQUIPMENT AND FACILITIES MANAGEMENT STUDE... GS-03 Full-time Nonseasonal
42653 4 SC 2805-ELECTRICIAN WY-10 Intermittent Nonseasonal
42718 4 SJ 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE DE-02 Part-time Nonseasonal
42750 4 SJ 5701-MISC TRANSPORTATION/MOBILE EQUIPMENT OPER XF-01 Full-time Nonseasonal
42799 4 SJ 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE DE-01 Intermittent Nonseasonal
42910 3 SC 0499-BIOLOGICAL SCIENCE STUDENT TRAINEE DB-01 Full-time Nonseasonal
43219 3 SJ 5407-ELECTRICAL POWER CONTROLLING WB-00 Part-time Nonseasonal
43234 3 SJ 5701-MISC TRANSPORTATION/MOBILE EQUIPMENT OPER WG-03 Part-time Nonseasonal
43355 3 SJ 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM DJ-05 Intermittent Nonseasonal
43595 1 SJ 5426-LOCK AND DAM OPERATING WY-03 Full-time Nonseasonal
43739 2 SF 7404-COOKING XH-06 Full-time Seasonal
44192 3 SJ 1599-MATHEMATICS AND STATISTICS STUDENT TRAINEE DB-01 Part-time Nonseasonal
44297 4 SD 1530-STATISTICS DB-04 Part-time Nonseasonal
44406 2 SC 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE GS-05 Intermittent Nonseasonal
44419 2 SA 0890-AGRICULTURAL ENGINEERING DB-02 Full-time Nonseasonal
44469 2 SK 7401-MISC FOOD PREPARATION AND SERVING XH-07 Full-time Seasonal
44784 2 SI 5782-SHIP OPERATING XH-13 Full-time Seasonal
44788 1 SJ 7404-COOKING XF-05 Full-time Nonseasonal
44901 2 SC 1640-FACILITY OPERATIONS SERVICES GS-09 Intermittent Nonseasonal
44952 2 SC 5782-SHIP OPERATING XH-13 Full-time Seasonal
45157 3 SD 0808-ARCHITECTURE DB-05 Full-time Nonseasonal
45354 2 SD 5725-CRANE OPERATING XH-12 Full-time Nonseasonal
45458 1 SJ 3703-WELDING WG-10 Intermittent Nonseasonal
45537 1 SK 0544-CIVILIAN PAY GS-06 Full-time Seasonal
50004 3 SC 0682-DENTAL HYGIENE GS-07 Intermittent Nonseasonal
50228 1 SJ 0560-BUDGET ANALYSIS DJ-03 Part-time Nonseasonal
50820 3 SC 0401-GENERAL NATURAL RESOURCES MANAGEMENT AND ... DB-02 Part-time Nonseasonal
51222 3 SC 0186-SOCIAL SERVICES AID AND ASSISTANT GS-08 Part-time Nonseasonal
51430 1 SC 1712-TRAINING INSTRUCTION DJ-03 Full-time Nonseasonal
51929 1 SC 0089-EMERGENCY MANAGEMENT SPECIALIST DJ-04 Full-time Nonseasonal
52912 2 SC 0085-SECURITY GUARD GS-12 Full-time Nonseasonal
53771 3 SC 6610-SMALL ARMS REPAIRING WG-07 Full-time Nonseasonal
54340 1 SJ 1035-PUBLIC AFFAIRS GS-06 Full-time Nonseasonal
55115 3 SJ 5801-MISC TRANSPORTATION/MOBILE EQUIPMT MAINTNE WG-08 Intermittent Nonseasonal
55382 4 SC 8810-AIRCRAFT PROPELLER MECHANIC WG-08 Full-time Nonseasonal
55575 4 SC 0203-HUMAN RESOURCES ASSISTANCE GS-05 Intermittent Nonseasonal
55689 4 SJ 2604-ELECTRONICS MECHANIC WG-12 Part-time Nonseasonal
57336 3 SJ 5413-FUEL DISTRIBUTION SYSTEM OPERATING WG-05 Intermittent Nonseasonal
57671 1 SJ 5801-MISC TRANSPORTATION/MOBILE EQUIPMT MAINTNE WG-08 Intermittent Nonseasonal
59155 3 SD 3101-MISC FABRIC AND LEATHER WORK WS-11 Full-time Nonseasonal
59259 4 SC 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... AD-00 Full-time Nonseasonal
60044 3 SA 6610-SMALL ARMS REPAIRING WL-06 Full-time Nonseasonal
60716 4 SJ 0101-SOCIAL SCIENCE AD-00 Part-time Nonseasonal
61107 2 SD 5201-MISCELLANEOUS OCCUPATIONS WS-15 Full-time Nonseasonal
61239 2 SI 2601-MISC ELECTRONIC EQUIPMT INSTALL & MAINTNE WG-04 Full-time Nonseasonal
62578 4 SA 2608-ELECTRONIC DIGITAL COMPUTER MECHANIC WL-10 Full-time Nonseasonal
63872 3 SF 2005-SUPPLY CLERICAL AND TECHNICIAN GS-04 Full-time Seasonal
63902 3 SF 2610-ELECTRONIC INTEGRATED SYSTEMS MECHANIC WT-00 Full-time Nonseasonal
64021 2 SD 3105-FABRIC WORKING WL-11 Full-time Nonseasonal
64348 3 SC 3101-MISC FABRIC AND LEATHER WORK WG-01 Part-time Nonseasonal
64386 3 SJ 2299-INFORMATION TECHNOLOGY STUDENT TRAINEE DE-01 Part-time Nonseasonal
64539 2 SA 1550-COMPUTER SCIENCE DB-03 Part-time Nonseasonal
64565 2 SC 0830-MECHANICAL ENGINEERING DB-02 Part-time Nonseasonal
64776 2 SJ 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM DJ-04 Part-time Nonseasonal
65181 4 SC 1510-ACTUARIAL SCIENCE GS-15 Intermittent Nonseasonal
65336 1 SC 0110-ECONOMIST AD-00 Intermittent Nonseasonal
65528 1 SA 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... ZA-02 Full-time Nonseasonal
65673 2 SD 1361-NAVIGATIONAL INFORMATION ZA-03 Full-time Nonseasonal
65698 3 SC 2299-INFORMATION TECHNOLOGY STUDENT TRAINEE ZP-01 Full-time Nonseasonal
65736 3 SD 0817-SURVEY TECHNICAL ZT-02 Full-time Nonseasonal
65783 3 SD 5786-SMALL CRAFT OPERATING WG-08 Part-time Nonseasonal
65909 2 SK 1530-STATISTICS ZP-04 Intermittent Nonseasonal
65913 4 SJ 0410-ZOOLOGY ZP-02 Full-time Nonseasonal
65949 1 SD 1382-FOOD TECHNOLOGY ZP-05 Full-time Nonseasonal
66022 2 SC 9932-FIRST ASSISTANT ENGINEER WM-11 Full-time Nonseasonal
66070 2 SA 0505-FINANCIAL MANAGEMENT ZA-05 Full-time Nonseasonal
66108 2 SA 0361-EQUAL OPPORTUNITY ASSISTANCE ZS-04 Full-time Nonseasonal
66142 3 SK 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM ZA-05 Part-time Nonseasonal
66160 1 SC 0401-GENERAL NATURAL RESOURCES MANAGEMENT AND ... ZP-02 Full-time Seasonal
66266 4 SD 1016-MUSEUM SPECIALIST AND TECHNICIAN ZA-03 Full-time Nonseasonal
66322 3 SJ 1140-TRADE SPECIALIST ED-00 Intermittent Nonseasonal
66328 2 SD 1801-GENERAL INSPECTION, INVESTIGATION, ENFORC... GM-15 Full-time Nonseasonal
66555 2 SC 1224-PATENT EXAMINING GS-09 Full-time Seasonal
66966 3 SC 1222-PATENT ATTORNEY AD-00 Intermittent Nonseasonal
67005 1 SC 1299-COPYRIGHT AND PATENT STUDENT TRAINEE GS-04 Full-time Nonseasonal
67046 2 SC 1320-CHEMISTRY ZP-04 Full-time Seasonal
67055 2 SC 0341-ADMINISTRATIVE OFFICER ZA-02 Part-time Nonseasonal
67061 3 SA 0203-HUMAN RESOURCES ASSISTANCE ZS-04 Part-time Nonseasonal
67064 3 SC 0201-HUMAN RESOURCES MANAGEMENT ZA-03 Intermittent Nonseasonal
67109 4 SJ 2210-INFORMATION TECHNOLOGY MANAGEMENT ZP-05 Intermittent Nonseasonal
67139 4 SC 1310-PHYSICS ZP-03 Part-time Nonseasonal
67145 4 SC 0342-SUPPORT SERVICES ADMINISTRATION ZS-02 Full-time Nonseasonal
67162 4 SK 0809-CONSTRUCTION CONTROL TECHNICAL ZT-02 Full-time Nonseasonal
67169 3 SC 0804-FIRE PROTECTION ENGINEERING ZP-03 Full-time Nonseasonal
67178 4 SC 0342-SUPPORT SERVICES ADMINISTRATION ZS-03 Full-time Nonseasonal
67561 1 SD 1531-STATISTICAL ASSISTANT GG-05 Full-time Seasonal
67840 1 SD 1530-STATISTICS GG-15 Full-time Nonseasonal
69645 3 SA 0201-HUMAN RESOURCES MANAGEMENT GS-07 Full-time Seasonal
69678 3 SC 1371-CARTOGRAPHIC TECHNICIAN GS-04 Full-time Seasonal
70986 2 SA 1529-MATHEMATICAL STATISTICS GS-11 Full-time Seasonal
72229 4 SC 1099-INFORMATION AND ARTS STUDENT TRAINEE CT-04 Full-time Nonseasonal
72291 2 SA 0132-INTELLIGENCE CU-15 Full-time Nonseasonal
72302 2 SD 0905-GENERAL ATTORNEY CU-14 Part-time Nonseasonal
72317 3 SD 0580-CREDIT UNION EXAMINER CU-14 Part-time Nonseasonal
72323 3 SA 0260-EQUAL EMPLOYMENT OPPORTUNITY CU-15 Full-time Nonseasonal
72335 1 SD 0510-ACCOUNTING CU-15 Full-time Nonseasonal
72337 2 SD 1102-CONTRACTING CU-13 Full-time Nonseasonal
72428 2 SC 1102-CONTRACTING NH-04 Part-time Job Sharer Nonseasonal
73089 2 SC 1082-WRITING AND EDITING AD-01 Full-time Nonseasonal
73246 3 SJ 0203-HUMAN RESOURCES ASSISTANCE GS-07 Intermittent Nonseasonal
73318 2 SC 1999-QUALITY INSPECTION STUDENT TRAINEE GS-03 Full-time Nonseasonal
74032 4 SJ 1107-PROPERTY DISPOSAL CLERICAL AND TECHNICIAN GS-04 Full-time Nonseasonal
74457 3 SC 2032-PACKAGING GS-14 Part-time Nonseasonal
75550 1 SD 0080-SECURITY ADMINISTRATION IE-00 Full-time Nonseasonal
75574 4 SD 0306-GOVERNMENT INFORMATION SPECIALIST GG-14 Full-time Nonseasonal
75624 4 SC 0806-MATERIALS ENGINEERING AD-00 Full-time Nonseasonal
75634 1 SJ 0830-MECHANICAL ENGINEERING EE-00 Full-time Nonseasonal
75813 1 SC 0631-OCCUPATIONAL THERAPIST AD-16 Full-time Seasonal
75845 4 SH 1710-EDUCATION AND VOCATIONAL TRAINING AD-13 Part-time Seasonal
76307 4 SC 0665-SPEECH PATHOLOGY AND AUDIOLOGY AD-14 Full-time Seasonal
76420 4 SH 1710-EDUCATION AND VOCATIONAL TRAINING AD-14 Part-time Nonseasonal
76667 3 SC 0640-HEALTH AID AND TECHNICIAN GS-04 Part-time Seasonal
76735 4 SJ 0610-NURSE AD-11 Intermittent Nonseasonal
76771 3 SC 1710-EDUCATION AND VOCATIONAL TRAINING AD-00 Part-time Seasonal
77484 2 SC 0808-ARCHITECTURE NH-02 Full-time Nonseasonal
77695 3 SJ 7408-FOOD SERVICE WORKING WL-02 Part-time Nonseasonal
77887 1 SF 1101-GENERAL BUSINESS AND INDUSTRY GS-02 Full-time Nonseasonal
77916 1 SJ 1101-GENERAL BUSINESS AND INDUSTRY GS-01 Full-time Nonseasonal
80303 4 SC 0599-FINANCIAL MANAGEMENT STUDENT TRAINEE GS-05 Full-time Seasonal
83224 4 SD 1101-GENERAL BUSINESS AND INDUSTRY AD-02 Full-time Nonseasonal
83250 2 SD 0346-LOGISTICS MANAGEMENT AD-03 Full-time Nonseasonal
83404 1 SA 1799-EDUCATION STUDENT TRAINEE NJ-02 Full-time Nonseasonal
83422 2 SC 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM NH-04 Intermittent Nonseasonal
83425 2 SD 0896-INDUSTRIAL ENGINEERING AD-12 Full-time Nonseasonal
83442 1 SJ 1701-GENERAL EDUCATION AND TRAINING AD-22 Intermittent Nonseasonal
84642 4 SC 1060-PHOTOGRAPHY GS-12 Part-time Nonseasonal
85721 3 SD 0610-NURSE GL-10 Part-time Nonseasonal
85965 2 SC 1199-BUSINESS AND INDUSTRY STUDENT TRAINEE GL-04 Part-time Nonseasonal
86905 4 SJ 0299-HUMAN RESOURCES MANAGEMENT STUDENT TRAINEE GL-05 Full-time Nonseasonal
... ... ... ... ... ...
138901 2 SC 1550-COMPUTER SCIENCE EG-00 Intermittent Nonseasonal
138906 3 SJ 0018-SAFETY AND OCCUPATIONAL HEALTH MANAGEMENT GS-15 Intermittent Nonseasonal
138918 3 SJ 1520-MATHEMATICS EE-00 Intermittent Nonseasonal
138947 2 SC 1520-MATHEMATICS EG-00 Intermittent Nonseasonal
138988 1 SD 0170-HISTORY AD-04 Full-time Nonseasonal
139029 2 SC 1040-LANGUAGE SPECIALIST GS-07 Part-time Nonseasonal
139618 1 SJ 1550-COMPUTER SCIENCE EF-00 Intermittent Nonseasonal
140156 4 SJ 0020-COMMUNITY PLANNING EE-00 Full-time Nonseasonal
140167 1 SC 1010-EXHIBITS SPECIALIST GS-11 Part-time Nonseasonal
140258 4 SJ 1499-LIBRARY AND ARCHIVES STUDENT TRAINEE GS-02 Part-time Nonseasonal
140910 1 SK 1301-GENERAL PHYSICAL SCIENCE AJ-00 Full-time Nonseasonal
140921 3 SD 0080-SECURITY ADMINISTRATION EG-00 Intermittent Nonseasonal
140948 4 SC 0999-LEGAL OCCUPATIONS STUDENT TRAINEE GG-07 Full-time Nonseasonal
140949 4 SJ 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE GG-05 Full-time Nonseasonal
140981 4 SC 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE GG-05 Full-time Nonseasonal
140984 4 SC 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE GG-09 Full-time Nonseasonal
140987 4 SC 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... GG-03 Full-time Nonseasonal
140992 4 SC 0999-LEGAL OCCUPATIONS STUDENT TRAINEE GG-09 Full-time Nonseasonal
141000 4 SJ 0343-MANAGEMENT AND PROGRAM ANALYSIS GG-07 Part-time Nonseasonal
141005 4 SC 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... GG-05 Full-time Nonseasonal
141010 4 SC 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... GG-07 Full-time Nonseasonal
141027 4 SC 2299-INFORMATION TECHNOLOGY STUDENT TRAINEE GG-07 Full-time Nonseasonal
141029 4 SC 1399-PHYSICAL SCIENCE STUDENT TRAINEE GG-07 Full-time Nonseasonal
141056 2 SD 0482-FISH BIOLOGY GG-15 Full-time Nonseasonal
141057 4 SC 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... GG-04 Full-time Nonseasonal
141071 4 SJ 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... GG-05 Full-time Nonseasonal
141081 4 SJ 0599-FINANCIAL MANAGEMENT STUDENT TRAINEE GG-05 Part-time Nonseasonal
141100 4 SJ 0599-FINANCIAL MANAGEMENT STUDENT TRAINEE GG-05 Full-time Nonseasonal
141128 3 SJ 0801-GENERAL ENGINEERING GG-15 Intermittent Nonseasonal
141145 2 SJ 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... GG-05 Full-time Nonseasonal
141150 1 SJ 2299-INFORMATION TECHNOLOGY STUDENT TRAINEE GG-09 Full-time Nonseasonal
141153 3 SD 1301-GENERAL PHYSICAL SCIENCE GG-15 Part-time Nonseasonal
141852 3 SD 1811-CRIMINAL INVESTIGATION IE-00 Full-time Nonseasonal
141986 2 SD 0332-COMPUTER OPERATION NC-03 Full-time Nonseasonal
141991 3 SC 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE NR-01 Full-time Nonseasonal
142033 2 SA 2091-SALES STORE CLERICAL NC-01 Full-time Nonseasonal
142047 1 SD 0690-INDUSTRIAL HYGIENE NO-04 Full-time Nonseasonal
142061 3 SC 0999-LEGAL OCCUPATIONS STUDENT TRAINEE NC-01 Full-time Nonseasonal
142069 4 SC 1399-PHYSICAL SCIENCE STUDENT TRAINEE NP-01 Full-time Nonseasonal
142072 4 SJ 0855-ELECTRONICS ENGINEERING NP-03 Intermittent Nonseasonal
142075 3 SC 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE NR-01 Part-time Nonseasonal
142111 2 SA 0809-CONSTRUCTION CONTROL TECHNICAL NR-03 Full-time Nonseasonal
142119 4 SC 1310-PHYSICS NP-04 Part-time Nonseasonal
142154 3 SC 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE NP-02 Part-time Nonseasonal
142163 1 SJ 1411-LIBRARY TECHNICIAN NC-01 Full-time Nonseasonal
142168 2 SC 0086-SECURITY CLERICAL AND ASSISTANCE NC-01 Full-time Nonseasonal
142203 1 SJ 1399-PHYSICAL SCIENCE STUDENT TRAINEE NR-02 Part-time Nonseasonal
143128 2 SC 0638-RECREATION/CREATIVE ARTS THERAPIST GS-11 Part-time Nonseasonal
143264 1 SD 6901-MISC WAREHOUSING AND STOCK HANDLING WG-03 Part-time Nonseasonal
143491 4 SC 0895-INDUSTRIAL ENGINEERING TECHNICAL GS-04 Full-time Nonseasonal
143699 3 SJ 2210-INFORMATION TECHNOLOGY MANAGEMENT DS-01 Full-time Nonseasonal
143939 3 SK 1311-PHYSICAL SCIENCE TECHNICIAN DT-04 Part-time Nonseasonal
144126 4 SC 0801-GENERAL ENGINEERING DP-05 Part-time Nonseasonal
144158 2 SC 0830-MECHANICAL ENGINEERING NM-03 Full-time Nonseasonal
144164 2 SA 1103-INDUSTRIAL PROPERTY MANAGEMENT DA-04 Full-time Nonseasonal
144282 3 SJ 2299-INFORMATION TECHNOLOGY STUDENT TRAINEE DS-01 Full-time Nonseasonal
144314 3 SJ 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE DP-03 Full-time Nonseasonal
144465 3 SC 1199-BUSINESS AND INDUSTRY STUDENT TRAINEE DG-01 Full-time Nonseasonal
144579 2 SC 1320-CHEMISTRY DP-02 Part-time Nonseasonal
144679 3 SD 0303-MISCELLANEOUS CLERK AND ASSISTANT DG-06 Full-time Nonseasonal
144866 1 SC 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE DT-01 Part-time Nonseasonal
144902 4 SJ 1710-EDUCATION AND VOCATIONAL TRAINING AD-01 Part-time Nonseasonal
145001 4 SJ 1710-EDUCATION AND VOCATIONAL TRAINING AD-03 Part-time Nonseasonal
145028 3 SC 1710-EDUCATION AND VOCATIONAL TRAINING AD-01 Full-time Seasonal
145194 1 SC 0006-CORRECTIONAL INSTITUTION ADMINISTRATION GS-13 Part-time Nonseasonal
146139 1 SD 6641-ORDNANCE EQUIPMENT MECHANIC WG-12 Full-time Nonseasonal
146276 2 SD 1515-OPERATIONS RESEARCH ND-05 Part-time Nonseasonal
146408 4 SJ 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE GS-04 Full-time Seasonal
146441 4 SC 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE NT-01 Full-time Nonseasonal
146517 4 SC 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE GS-04 Full-time Seasonal
146526 3 SD 4301-MISCELLANEOUS PLIABLE MATERIALS WORK WG-11 Full-time Nonseasonal
146559 2 SC 0415-TOXICOLOGY ND-04 Part-time Nonseasonal
146714 4 SK 1102-CONTRACTING NT-05 Part-time Nonseasonal
146905 4 SC 1222-PATENT ATTORNEY NT-05 Full-time Nonseasonal
146968 2 SC 1515-OPERATIONS RESEARCH ND-05 Part-time Nonseasonal
147275 3 SD 1521-MATHEMATICS TECHNICIAN GS-12 Full-time Nonseasonal
147378 4 SF 3414-MACHINING WG-04 Full-time Nonseasonal
147609 2 SD 5876-ELECTROMOTIVE EQUIPMENT MECHANIC WG-11 Full-time Nonseasonal
148473 3 SC 0804-FIRE PROTECTION ENGINEERING GS-13 Part-time Nonseasonal
148595 3 SD 5407-ELECTRICAL POWER CONTROLLING WS-11 Full-time Nonseasonal
148604 4 SA 0021-COMMUNITY PLANNING TECHNICIAN GS-04 Full-time Nonseasonal
149482 2 SI 5409-WATER TREATMENT PLANT OPERATING WS-11 Full-time Nonseasonal
149539 4 SA 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... GG-07 Full-time Nonseasonal
149587 4 SC 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE GS-03 Full-time Seasonal
149648 4 SD 1082-WRITING AND EDITING GS-08 Full-time Nonseasonal
149876 3 SD 5413-FUEL DISTRIBUTION SYSTEM OPERATING WL-09 Full-time Nonseasonal
150646 3 SJ 0801-GENERAL ENGINEERING EE-00 Full-time Nonseasonal
150728 1 SJ 9936-ENGINE MIDSHIPMAN WM-21 Full-time Nonseasonal
150901 1 SJ 9917-DECK MIDSHIPMAN WM-21 Full-time Nonseasonal
151029 2 SJ 9917-DECK MIDSHIPMAN WM-21 Full-time Nonseasonal
151039 2 SJ 9936-ENGINE MIDSHIPMAN WM-21 Full-time Nonseasonal
151282 1 SD 2131-FREIGHT RATE NG-03 Full-time Nonseasonal
151366 1 SA 0260-EQUAL EMPLOYMENT OPPORTUNITY DP-04 Full-time Nonseasonal
151369 1 SI 0361-EQUAL OPPORTUNITY ASSISTANCE NG-02 Full-time Nonseasonal
151387 3 SC 2299-INFORMATION TECHNOLOGY STUDENT TRAINEE NO-03 Full-time Nonseasonal
151397 4 SC 1199-BUSINESS AND INDUSTRY STUDENT TRAINEE NG-01 Part-time Nonseasonal
151625 3 SD 0510-ACCOUNTING NO-06 Full-time Nonseasonal
153743 2 SD 4373-MOLDING WD-06 Full-time Nonseasonal
154266 1 SD 3802-METAL FORGING WL-10 Full-time Nonseasonal
154284 3 SC 3359-INSTRUMENT MECHANIC WG-01 Full-time Seasonal
154346 3 SC 0871-NAVAL ARCHITECTURE GS-12 Part-time Nonseasonal
154627 3 SC 3801-MISCELLANEOUS METAL WORK WG-03 Full-time Seasonal
154637 3 SA 3806-SHEET METAL MECHANIC WG-05 Full-time Seasonal
155027 2 SD 3401-MISCELLANEOUS MACHINE TOOL WORK WS-15 Full-time Nonseasonal
155347 1 SC 3414-MACHINING WG-08 Full-time Seasonal
156304 3 SC 0086-SECURITY CLERICAL AND ASSISTANCE FP-06 Full-time Nonseasonal
156307 4 SD 2130-TRAFFIC MANAGEMENT FP-02 Full-time Nonseasonal
156310 4 SC 1087-EDITORIAL ASSISTANCE FP-07 Full-time Nonseasonal
156321 2 SA 0905-GENERAL ATTORNEY FP-04 Full-time Nonseasonal
156355 3 SA 0510-ACCOUNTING FP-03 Full-time Nonseasonal
156360 2 SI 0303-MISCELLANEOUS CLERK AND ASSISTANT FP-07 Part-time Nonseasonal
156361 3 SC 0303-MISCELLANEOUS CLERK AND ASSISTANT FP-09 Part-time Nonseasonal
156368 4 SC 0669-MEDICAL RECORDS ADMINISTRATION FP-05 Full-time Nonseasonal
156374 2 SJ 0303-MISCELLANEOUS CLERK AND ASSISTANT FP-08 Full-time Nonseasonal
156382 4 SC 1702-EDUCATION AND TRAINING TECHNICIAN FP-06 Full-time Nonseasonal
156448 2 SC 0303-MISCELLANEOUS CLERK AND ASSISTANT FP-08 Full-time Nonseasonal
156454 1 SC 1750-INSTRUCTIONAL SYSTEMS FP-04 Full-time Nonseasonal
156479 4 SA 0260-EQUAL EMPLOYMENT OPPORTUNITY FP-04 Full-time Nonseasonal
156484 1 SA 0201-HUMAN RESOURCES MANAGEMENT FP-02 Part-time Nonseasonal
157610 4 SC 2210-INFORMATION TECHNOLOGY MANAGEMENT SK-16 Part-time Nonseasonal
157665 2 SD 1410-LIBRARIAN SK-15 Full-time Nonseasonal
157751 4 SC 0201-HUMAN RESOURCES MANAGEMENT SK-13 Part-time Nonseasonal
157761 4 SC 1499-LIBRARY AND ARCHIVES STUDENT TRAINEE SK-07 Full-time Nonseasonal
157764 4 SD 0340-PROGRAM MANAGEMENT SK-16 Full-time Nonseasonal
157782 1 SJ 0950-PARALEGAL SPECIALIST SK-07 Full-time Nonseasonal
157790 2 SC 1750-INSTRUCTIONAL SYSTEMS SK-16 Full-time Nonseasonal
157794 2 SC 1410-LIBRARIAN SK-09 Full-time Nonseasonal
157795 3 SI 0080-SECURITY ADMINISTRATION SK-17 Full-time Nonseasonal
157798 2 SK 0201-HUMAN RESOURCES MANAGEMENT SK-14 Part-time Nonseasonal
157823 3 SC 2210-INFORMATION TECHNOLOGY MANAGEMENT SO-01 Full-time Nonseasonal
157834 4 SJ 0501-FINANCIAL ADMINISTRATION AND PROGRAM SK-13 Part-time Nonseasonal
158130 3 SC 0804-FIRE PROTECTION ENGINEERING GS-07 Full-time Nonseasonal
158187 2 SC 5701-MISC TRANSPORTATION/MOBILE EQUIPMENT OPER WL-02 Full-time Nonseasonal
158375 1 SF 0356-DATA TRANSCRIBER GS-07 Full-time Nonseasonal
158552 3 SJ 0130-FOREIGN AFFAIRS GS-15 Intermittent Nonseasonal
158577 3 SC 0130-FOREIGN AFFAIRS AD-00 Full-time Nonseasonal
158579 3 SJ 2032-PACKAGING GS-12 Intermittent Nonseasonal
158583 3 SA 0130-FOREIGN AFFAIRS GG-14 Full-time Nonseasonal
158587 3 SC 0130-FOREIGN AFFAIRS EF-15 Intermittent Nonseasonal
158599 3 SJ 0130-FOREIGN AFFAIRS EF-14 Intermittent Nonseasonal
158622 3 SJ 0080-SECURITY ADMINISTRATION GS-14 Intermittent Nonseasonal
158654 3 SJ 0130-FOREIGN AFFAIRS GS-14 Intermittent Nonseasonal
158692 2 SJ 0318-SECRETARY GS-10 Intermittent Nonseasonal
158775 2 SD 1008-INTERIOR DESIGN GS-15 Full-time Nonseasonal
158847 3 SJ 0391-TELECOMMUNICATIONS GS-11 Intermittent Nonseasonal
158884 3 SD 0150-GEOGRAPHY ES-** Full-time Nonseasonal
158987 2 SD 0132-INTELLIGENCE GM-13 Full-time Nonseasonal
159027 3 SC 0130-FOREIGN AFFAIRS GS-14 Intermittent Nonseasonal
159029 3 SC 1109-GRANTS MANAGEMENT AD-05 Full-time Nonseasonal
159042 3 SJ 1035-PUBLIC AFFAIRS AD-05 Full-time Nonseasonal
159044 3 SC 0306-GOVERNMENT INFORMATION SPECIALIST GS-09 Part-time Nonseasonal
159045 3 SJ 0130-FOREIGN AFFAIRS EF-15 Intermittent Nonseasonal
159103 3 SC 0130-FOREIGN AFFAIRS ED-15 Intermittent Nonseasonal
159147 3 SK 0130-FOREIGN AFFAIRS GS-14 Intermittent Nonseasonal
159217 1 SC 0130-FOREIGN AFFAIRS EF-15 Full-time Nonseasonal
159782 3 SC 0905-GENERAL ATTORNEY AA-06 Intermittent Nonseasonal
159803 2 SJ 0901-GENERAL LEGAL AND KINDRED ADMINISTRATION GS-09 Intermittent Nonseasonal
160700 3 SC 0260-EQUAL EMPLOYMENT OPPORTUNITY GS-15 Intermittent Nonseasonal
160723 3 SJ 0905-GENERAL ATTORNEY GS-12 Intermittent Nonseasonal
161612 2 SC 0998-CLAIMS ASSISTANCE AND EXAMINING GS-07 Intermittent Nonseasonal
164022 2 SI 0105-SOCIAL INSURANCE ADMINISTRATION GS-06 Full-time Nonseasonal
164528 1 SD 0160-CIVIL RIGHTS ANALYSIS ES-** Full-time Nonseasonal
164537 3 SC 0020-COMMUNITY PLANNING GS-12 Intermittent Nonseasonal
164593 3 SC 1499-LIBRARY AND ARCHIVES STUDENT TRAINEE GS-07 Part-time Nonseasonal
164751 3 SC 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM FJ-00 Full-time Nonseasonal
165005 3 SD 0413-PHYSIOLOGY FV-J Full-time Nonseasonal
165047 4 SD 1825-AVIATION SAFETY EV-02 Full-time Nonseasonal
165056 1 SD 0346-LOGISTICS MANAGEMENT FV-G Part-time Nonseasonal
165207 4 SI 0861-AEROSPACE ENGINEERING FG-13 Full-time Nonseasonal
165242 3 SD 1071-AUDIOVISUAL PRODUCTION FV-G Full-time Nonseasonal
165440 4 SD 0601-GENERAL HEALTH SCIENCE FV-G Full-time Nonseasonal
165681 4 SC 0401-GENERAL NATURAL RESOURCES MANAGEMENT AND ... FV-G Full-time Nonseasonal
165694 4 SK 0343-MANAGEMENT AND PROGRAM ANALYSIS FG-15 Full-time Nonseasonal
165819 3 SC 0343-MANAGEMENT AND PROGRAM ANALYSIS FG-07 Full-time Nonseasonal
166216 2 SJ 0899-ENGINEERING AND ARCHITECTURE STUDENT TRAINEE FV-C Part-time Nonseasonal
166649 2 SD 0401-GENERAL NATURAL RESOURCES MANAGEMENT AND ... FV-I Full-time Nonseasonal
167164 3 SD 2010-INVENTORY MANAGEMENT FG-09 Full-time Nonseasonal
167232 1 SC 0675-MEDICAL RECORDS TECHNICIAN FV-G Full-time Nonseasonal
167765 4 SA 0810-CIVIL ENGINEERING GS-07 Full-time Seasonal
167983 4 SC 0090-GUIDE GS-01 Full-time Nonseasonal
167984 4 SJ 0090-GUIDE GS-01 Full-time Nonseasonal
167990 3 SD 5786-SMALL CRAFT OPERATING WL-12 Full-time Seasonal
168609 1 SA 1520-MATHEMATICS OR-51 Full-time Nonseasonal
169045 4 SC 0356-DATA TRANSCRIBER GS-03 Full-time Seasonal
169402 4 SJ 0356-DATA TRANSCRIBER GS-03 Part-time Nonseasonal
169897 3 SD 0341-ADMINISTRATIVE OFFICER IR-SM Full-time Nonseasonal
170587 3 SJ 0303-MISCELLANEOUS CLERK AND ASSISTANT GS-02 Full-time Seasonal
170682 4 SA 0501-FINANCIAL ADMINISTRATION AND PROGRAM GS-05 Full-time Seasonal
171044 3 SC 0303-MISCELLANEOUS CLERK AND ASSISTANT GS-02 Full-time Seasonal
172568 3 SJ 2005-SUPPLY CLERICAL AND TECHNICIAN GS-04 Intermittent Nonseasonal
172848 1 SI 0356-DATA TRANSCRIBER GS-03 Full-time Seasonal
173512 2 SD 1101-GENERAL BUSINESS AND INDUSTRY GS-09 Part-time Job Sharer Nonseasonal
174326 4 SC 0592-TAX EXAMINING GS-05 Part-time Seasonal
175009 2 SD 1397-DOCUMENT ANALYSIS IR-FM Full-time Nonseasonal
175175 2 SC 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... GS-06 Full-time Seasonal
176572 3 SC 0501-FINANCIAL ADMINISTRATION AND PROGRAM GS-09 Part-time Seasonal
176724 4 SC 0356-DATA TRANSCRIBER GS-02 Full-time Seasonal
177579 1 SJ 0512-INTERNAL REVENUE AGENT GS-13 Intermittent Nonseasonal
178189 1 SC 0356-DATA TRANSCRIBER GS-02 Full-time Seasonal
178500 4 SI 3869-METAL FORMING MACHINE OPERATING WG-06 Intermittent Nonseasonal
178757 1 SD 0511-AUDITING NB-07 Full-time Nonseasonal
178763 4 SC 0399-ADMINISTRATION AND OFFICE SUPPORT STUDENT... NB-03 Full-time Nonseasonal
178811 3 SC 1160-FINANCIAL ANALYSIS NB-06 Part-time Nonseasonal
178820 4 SC 0110-ECONOMIST NB-06 Part-time Nonseasonal
178862 1 SD 0986-LEGAL ASSISTANCE NB-04 Full-time Nonseasonal
178979 1 SA 0303-MISCELLANEOUS CLERK AND ASSISTANT NB-02 Part-time Nonseasonal
179028 1 SC 0570-FINANCIAL INSTITUTION EXAMINING NB-06 Intermittent Nonseasonal
179323 4 SC 0511-AUDITING ES-** Intermittent Nonseasonal
180727 4 SC 0996-VETERANS CLAIMS EXAMINING GS-10 Part-time Nonseasonal
182096 1 SJ 0503-FINANCIAL CLERICAL AND ASSISTANCE GS-06 Intermittent Nonseasonal
182785 1 SJ 0621-NURSING ASSISTANT AD-00 Full-time Nonseasonal
183419 1 SJ 4102-PAINTING WB-00 Intermittent Nonseasonal
184918 1 SC 0661-PHARMACY TECHNICIAN GS-02 Intermittent Nonseasonal
185606 1 SC 0187-SOCIAL SERVICES GS-07 Intermittent Nonseasonal
185612 1 SC 7305-LAUNDRY MACHINE OPERATING WL-04 Full-time Nonseasonal
185619 4 SI 0102-SOCIAL SCIENCE AID AND TECHNICIAN GS-02 Part-time Nonseasonal
186470 3 SI 0661-PHARMACY TECHNICIAN GS-02 Part-time Nonseasonal
187280 2 SC 0625-AUTOPSY ASSISTANT GS-04 Intermittent Nonseasonal
188209 3 SC 7305-LAUNDRY MACHINE OPERATING WG-04 Part-time Nonseasonal
188290 3 SC 0083-POLICE GS-07 Intermittent Nonseasonal
189645 1 SD 1601-EQUIPMENT FACILITIES, AND SERVICES GS-11 Intermittent Nonseasonal
190784 3 SC 0180-PSYCHOLOGY AD-00 Intermittent Nonseasonal
191199 3 SJ 0199-SOCIAL SCIENCE STUDENT TRAINEE AD-00 Part-time Nonseasonal
193872 4 SJ 0185-SOCIAL WORK GS-02 Intermittent Nonseasonal
194936 2 SJ 1199-BUSINESS AND INDUSTRY STUDENT TRAINEE GS-02 Intermittent Nonseasonal
195894 1 SJ 0401-GENERAL NATURAL RESOURCES MANAGEMENT AND ... GS-10 Intermittent Nonseasonal
195895 1 SC 0299-HUMAN RESOURCES MANAGEMENT STUDENT TRAINEE GS-09 Part-time Nonseasonal
196227 4 SD 7301-MISC LAUNDRY, DRY CLEANING, AND PRESSING WS-08 Full-time Nonseasonal
196320 2 SC 0681-DENTAL ASSISTANT GS-08 Part-time Nonseasonal
196786 1 SC 0683-DENTAL LABORATORY AID AND TECHNICIAN GS-05 Part-time Nonseasonal
198299 4 SC 0699-MEDICAL AND HEALTH STUDENT TRAINEE GS-02 Full-time Nonseasonal
198505 4 SC 0530-CASH PROCESSING GS-05 Part-time Nonseasonal
199799 4 SC 0181-PSYCHOLOGY AID AND TECHNICIAN GS-04 Intermittent Nonseasonal
200273 2 SJ 0335-COMPUTER CLERK AND ASSISTANT GS-03 Intermittent Nonseasonal
200845 4 SJ 0525-ACCOUNTING TECHNICIAN GS-04 Intermittent Nonseasonal
201133 1 SC 7404-COOKING WG-07 Full-time Nonseasonal
201186 1 SD 0639-EDUCATIONAL THERAPIST GS-09 Full-time Nonseasonal
202022 4 SJ 0601-GENERAL HEALTH SCIENCE GS-05 Intermittent Nonseasonal
202948 2 SD 0637-MANUAL ARTS THERAPIST GS-11 Full-time Nonseasonal
203694 2 SC 0530-CASH PROCESSING VC-02 Intermittent Nonseasonal
203728 2 SC 5703-MOTOR VEHICLE OPERATING WG-02 Part-time Nonseasonal
203929 3 SC 0669-MEDICAL RECORDS ADMINISTRATION GS-11 Intermittent Nonseasonal
204060 2 SD 6901-MISC WAREHOUSING AND STOCK HANDLING WD-07 Full-time Nonseasonal
204563 2 SJ 0644-MEDICAL TECHNOLOGIST GS-07 Intermittent Nonseasonal
206370 1 SJ 0670-HEALTH SYSTEM ADMINISTRATION AD-00 Intermittent Nonseasonal
207091 1 SJ 0605-NURSE ANESTHETIST (TITLE 38) AD-00 Part-time Nonseasonal
210811 3 SD 4010-PRESCRIPTION EYEGLASS MAKING WL-09 Full-time Nonseasonal
211162 3 SC 0699-MEDICAL AND HEALTH STUDENT TRAINEE GS-06 Full-time Nonseasonal
211308 1 SC 0645-MEDICAL TECHNICIAN GS-01 Intermittent Nonseasonal
214123 1 SI 5406-UTILITY SYSTEMS OPERATING WS-12 Full-time Nonseasonal

853 rows × 5 columns

These 1293 separation observations do not have coverage within the EMP Dataset, thus, we will remove these observations as out of scope demographic in our analysis. Any attempt in predicting these values will not have enough data to support a significant response.

In [15]:
OPMDataMerged = OPMDataMerged[OPMDataMerged["IndAvgSalary"].notnull()]

print(len(OPMDataMerged[OPMDataMerged["IndAvgSalary"].isnull()]))

print(len(OPMDataMerged))
0
8170907


Placeholder Chunks for Data Quality check of salary against GS Grade Level Ranges



In [16]:
# Placeholder Chunks for Data Quality check of salary against GS Grade Level Ranges
In [ ]:
 

We are iterested to see how federal pension plans may impact attrition in this dataset. An interesting attribute to complement Length of service, is Years to Retirement. Utilizing a FERS retirement eligibility baseline of 57 years of age for all observations, and the lower limitation of age level ranges we compute a numeric value for length of retirement.

In [17]:
#Add Column YearsToRetirement

"""
    AGELVL,AGELVLT
    A,Less than 20
    B,20-24
    C,25-29
    D,30-34
    E,35-39
    F,40-44
    G,45-49
    H,50-54
    I,55-59
    J,60-64
    K,65 or more
    Z,Unspecified
"""
OPMDataMerged["LowerLimitAge"] = np.where(OPMDataMerged["AGELVL"]=="B", 20,
                                                np.where(OPMDataMerged["AGELVL"]=="C", 25,
                                                         np.where(OPMDataMerged["AGELVL"]=="D", 30,
                                                                  np.where(OPMDataMerged["AGELVL"]=="E", 35,
                                                                           np.where(OPMDataMerged["AGELVL"]=="F", 40,
                                                                                    np.where(OPMDataMerged["AGELVL"]=="G", 45,
                                                                                             np.where(OPMDataMerged["AGELVL"]=="H", 50,
                                                                                                      np.where(OPMDataMerged["AGELVL"]=="I", 55,
                                                                                                               np.where(OPMDataMerged["AGELVL"]=="J", 60,
                                                                                                                        np.where(OPMDataMerged["AGELVL"]=="K", 65,
                                                                                                                                 np.nan
                                                                                                                                )
                                                                                                                        )
                                                                                                               )
                                                                                                      )
                                                                                            )
                                                                                   )
                                                                          )
                                                                 )
                                                        )
                                               )  

retAge = 57

OPMDataMerged["YearsToRetirement"] = np.where(OPMDataMerged["AGELVL"]=="B", retAge-20,
                                                np.where(OPMDataMerged["AGELVL"]=="C", retAge-25,
                                                         np.where(OPMDataMerged["AGELVL"]=="D", retAge-30,
                                                                  np.where(OPMDataMerged["AGELVL"]=="E", retAge-35,
                                                                           np.where(OPMDataMerged["AGELVL"]=="F", retAge-40,
                                                                                    np.where(OPMDataMerged["AGELVL"]=="G", retAge-45,
                                                                                             np.where(OPMDataMerged["AGELVL"]=="H", retAge-50,
                                                                                                      np.where(OPMDataMerged["AGELVL"]=="I", retAge-55,
                                                                                                               np.where(OPMDataMerged["AGELVL"]=="J", retAge-60,
                                                                                                                        np.where(OPMDataMerged["AGELVL"]=="K", retAge-65,
                                                                                                                                 np.nan
                                                                                                                                )
                                                                                                                        )
                                                                                                               )
                                                                                                      )
                                                                                            )
                                                                                   )
                                                                          )
                                                                 )
                                                        )
                                               )  

print("Null Values for LowerLimitAge: " + str(len(OPMDataMerged[OPMDataMerged["LowerLimitAge"].isnull()])))
print("Null Values for YearsToRetirement: " + str(len(OPMDataMerged[OPMDataMerged["YearsToRetirement"].isnull()])))

display(OPMDataMerged.head())
display(OPMDataMerged.tail())
Null Values for LowerLimitAge: 0
Null Values for YearsToRetirement: 0
AGYSUB SEP DATECODE AGELVL GENDER GSEGRD LOSLVL LOC OCC PATCO PPGRD SALLVL TOA WORKSCH COUNT SALARY LOS AGYTYP AGYTYPT AGY AGYT AGYSUBT QTR AGELVLT LOSLVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT OCCT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT SALLVLT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement
0 AA00 SC 201507 C M 11 A 11 0905 1 GS-11 F 40 F 1.0 63722.0 0.8 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 4 25-29 Less than 1 year 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $60,000 - $69,999 2 Non-permanent 40-Excepted Service - Schedule A 1 Full-time Full-time Nonseasonal 205.0 1319 64540.593830 -818.593830 25.0 32.0
1 AA00 SC 201506 D F 15 C 11 0905 1 GS-15 L 30 F 1.0 126245.0 4.8 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 3 30-34 3 - 4 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $120,000 - $129,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time Full-time Nonseasonal 207.0 1132 149864.298504 -23619.298504 30.0 27.0
2 AF** SA 201503 H M 11 C 48 2210 2 GS-11 F 10 F 1.0 66585.0 4.9 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF**-INVALID 2 50-54 3 - 4 years 1 United States 48-TEXAS 1 White Collar 22 22xx-INFORMATION TECHNOLOGY 2210-INFORMATION TECHNOLOGY MANAGEMENT Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $60,000 - $69,999 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 439.0 1087 71530.963755 -4945.963755 50.0 7.0
3 AF02 SD 201506 I M 15 J 35 0301 2 GS-15 O 10 F 1.0 156737.0 39.8 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF02-AIR FORCE INSPECTION AGENCY (FO) 3 55-59 35 years or more 1 United States 35-NEW MEXICO 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $150,000 - $159,999 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 670.0 265 146735.220304 10001.779696 55.0 2.0
4 AF03 SC 201509 H M 13 B 06 0301 2 GS-13 I 15 F 1.0 92973.0 1.0 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF03-AIR FORCE OPERATIONAL TEST AND EVALUATION... 4 50-54 1 - 2 years 1 United States 06-CALIFORNIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $90,000 - $99,999 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 721.0 1853 101641.124025 -8668.124025 50.0 7.0
AGYSUB SEP DATECODE AGELVL GENDER GSEGRD LOSLVL LOC OCC PATCO PPGRD SALLVL TOA WORKSCH COUNT SALARY LOS AGYTYP AGYTYPT AGY AGYT AGYSUBT QTR AGELVLT LOSLVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT OCCT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT SALLVLT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement
8223188 ZU00 NS 201509 D NaN NaN C 11 0301 2 AD-00 G 48 F NaN 76377.0 4.8 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 30-34 3 - 4 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $70,000 - $79,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 -39463.182250 30.0 27.0
8223189 ZU00 NS 201509 K NaN NaN D 11 0301 2 AD-00 M 48 F NaN 139517.0 7.0 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 65 or more 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $130,000 - $139,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 23676.817750 65.0 -8.0
8223190 ZU00 NS 201509 K NaN NaN D 11 0301 2 AD-00 O 48 F NaN 158671.0 7.0 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 65 or more 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $150,000 - $159,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 42830.817750 65.0 -8.0
8223191 ZU00 NS 201509 B NaN NaN B 11 0301 2 AD-00 C 48 F NaN 36244.0 1.6 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 20-24 1 - 2 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $30,000 - $39,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 -79596.182250 20.0 37.0
8223192 ZU00 NS 201509 E NaN NaN D 11 0505 2 AD-00 I 48 F NaN 99288.0 5.0 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 35-39 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 05 05xx-ACCOUNTING AND BUDGET 0505-FINANCIAL MANAGEMENT Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $90,000 - $99,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 7.0 1391 148382.833333 -49094.833333 35.0 22.0

Pull Bureau of Labor Statistics data

In addition to the OPM data, we merge 10 attributes from the Bureau of Labor Statistics (BLS). Data is sourced from Federal Government industry codes across all regions. Although assumed to be highly correlated, we source both Level (Total number) and Rate (Percentage of Level to total employment and / or job openings) for the following statistics: 1) Job Openings, 2) Layoffs, 3) Quits, 4) Total Separations, and 5) Other Separations. While Rate paints an aggregated, holistic picture for job market trends, Level provides a raw count for total separations alone. Both these statistics were captured by a monthly aggregate and merged to the OPM data by their respective months.

In [18]:
%%time

def bls(series, start, end):
    headers = {'Content-type': 'application/json'}
    sID   = []
    
    for i in range(0,len(series)):
        sID.append(series[i][0])
    
    data = json.dumps({"seriesid": sID,
                       "startyear":start,
                       "endyear":end,
                       "catalog":False,
                       "calculations":False,
                       "annualaverage":False,
                       "registrationkey":"7a89c8d7979349fba8914b8be16a1646"})
    
    p = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
    json_data = json.loads(p.text)
    bls = []
    for series in json_data['Results']['series']:
        #x=prettytable.PrettyTable(["series id","year","period","value","footnotes"])
        result = pd.DataFrame(columns=["series id","year","period","value","footnotes"])
        seriesId = series['seriesID']
        for item in series['data']:
            year = item['year']
            period = item['period']
            value = item['value']
            footnotes=""
            for footnote in item['footnotes']:
                if footnote:
                    footnotes = footnotes + footnote['text'] + ','
            if 'M01' <= period <= 'M12':
                #x.add_row([seriesId,year,period,value,footnotes[0:-1]])
                y = pd.DataFrame({"series id" : seriesId,
                                  "year" : year,
                                  "period" : period,
                                  "value" : value,
                                  "footnotes" : footnotes}, index = [0])
                result = result.append(y, ignore_index = True)
        bls.append(result)
    return(bls)
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.68 µs
In [19]:
%%time

seriesList = [
              ['JTU91000000JOL','BLS_FEDERAL_JobOpenings_Level'],
              ['JTU91000000LDL','BLS_FEDERAL_Layoffs_Level'],
              ['JTU91000000OSL','BLS_FEDERAL_OtherSep_Level'],
              ['JTU91000000QUL','BLS_FEDERAL_Quits_Level'],
              ['JTU91000000TSL','BLS_FEDERAL_TotalSep_Level'],
              ['JTU91000000JOR','BLS_FEDERAL_JobOpenings_Rate'],
              ['JTU91000000LDR','BLS_FEDERAL_Layoffs_Rate'],
              ['JTU91000000OSR','BLS_FEDERAL_OtherSep_Rate'],
              ['JTU91000000QUR','BLS_FEDERAL_Quits_Rate'],
              ['JTU91000000TSR','BLS_FEDERAL_TotalSep_Rate']
             ]

# Pull job openings and labor turnover data
JTL = bls(seriesList, "2014", "2015")

seriesList = pd.DataFrame(seriesList, columns = ["series id","sName"])

##We need to replace these with actual Descriptor Column Names

for i in range(0,len(seriesList)):
    
    JTL[i] = JTL[i].merge(seriesList, on = "series id", how = 'inner')

    if len(JTL[i]) >0:
        name = JTL[i]["sName"].drop_duplicates().values[0]
    else:
        name = str(i)

    JTL[i][name] = JTL[i]["value"].apply(pd.to_numeric)
    JTL[i]["DATECODE"] = JTL[i]["year"] + JTL[i]["period"].str[-2:]
    del JTL[i]["value"]
    del JTL[i]["year"]
    del JTL[i]["period"]
    del JTL[i]["series id"]
    del JTL[i]["footnotes"]
    del JTL[i]["sName"]
    
    
    OPMDataMerged = OPMDataMerged.merge(JTL[i], on = "DATECODE", how = 'left')
    display(JTL[i].head())
    
BLS_FEDERAL_OtherSep_Rate DATECODE
0 0.4 201512
1 0.4 201511
2 0.4 201510
3 0.4 201509
4 0.5 201508
BLS_FEDERAL_Quits_Rate DATECODE
0 0.4 201512
1 0.4 201511
2 0.6 201510
3 0.5 201509
4 0.6 201508
BLS_FEDERAL_TotalSep_Level DATECODE
0 37 201512
1 35 201511
2 45 201510
3 38 201509
4 41 201508
BLS_FEDERAL_JobOpenings_Rate DATECODE
0 2.9 201512
1 2.6 201511
2 2.4 201510
3 1.9 201509
4 2.3 201508
BLS_FEDERAL_OtherSep_Level DATECODE
0 12 201512
1 10 201511
2 12 201510
3 12 201509
4 14 201508
BLS_FEDERAL_Quits_Level DATECODE
0 11 201512
1 10 201511
2 16 201510
3 14 201509
4 17 201508
BLS_FEDERAL_JobOpenings_Level DATECODE
0 83 201512
1 73 201511
2 68 201510
3 55 201509
4 67 201508
BLS_FEDERAL_Layoffs_Rate DATECODE
0 0.5 201512
1 0.6 201511
2 0.6 201510
3 0.4 201509
4 0.3 201508
BLS_FEDERAL_Layoffs_Level DATECODE
0 15 201512
1 15 201511
2 18 201510
3 12 201509
4 10 201508
BLS_FEDERAL_TotalSep_Rate DATECODE
0 1.3 201512
1 1.3 201511
2 1.6 201510
3 1.4 201509
4 1.5 201508
CPU times: user 39.4 s, sys: 8.63 s, total: 48 s
Wall time: 48.5 s
In [20]:
display(OPMDataMerged.head())
display(OPMDataMerged.tail())
AGYSUB SEP DATECODE AGELVL GENDER GSEGRD LOSLVL LOC OCC PATCO PPGRD SALLVL TOA WORKSCH COUNT SALARY LOS AGYTYP AGYTYPT AGY AGYT AGYSUBT QTR AGELVLT LOSLVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT OCCT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT SALLVLT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement BLS_FEDERAL_OtherSep_Rate BLS_FEDERAL_Quits_Rate BLS_FEDERAL_TotalSep_Level BLS_FEDERAL_JobOpenings_Rate BLS_FEDERAL_OtherSep_Level BLS_FEDERAL_Quits_Level BLS_FEDERAL_JobOpenings_Level BLS_FEDERAL_Layoffs_Rate BLS_FEDERAL_Layoffs_Level BLS_FEDERAL_TotalSep_Rate
0 AA00 SC 201507 C M 11 A 11 0905 1 GS-11 F 40 F 1.0 63722.0 0.8 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 4 25-29 Less than 1 year 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $60,000 - $69,999 2 Non-permanent 40-Excepted Service - Schedule A 1 Full-time Full-time Nonseasonal 205.0 1319 64540.593830 -818.593830 25.0 32.0 0.4 0.5 34 2.6 11 13 74 0.4 10 1.2
1 AA00 SC 201506 D F 15 C 11 0905 1 GS-15 L 30 F 1.0 126245.0 4.8 4 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES AA00-ADMINISTRATIVE CONFERENCE OF THE UNITED S... 3 30-34 3 - 4 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $120,000 - $129,999 1 Permanent 30-Excepted Service - Schedule A 1 Full-time Full-time Nonseasonal 207.0 1132 149864.298504 -23619.298504 30.0 27.0 0.4 0.5 34 2.3 12 13 65 0.4 10 1.2
2 AF** SA 201503 H M 11 C 48 2210 2 GS-11 F 10 F 1.0 66585.0 4.9 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF**-INVALID 2 50-54 3 - 4 years 1 United States 48-TEXAS 1 White Collar 22 22xx-INFORMATION TECHNOLOGY 2210-INFORMATION TECHNOLOGY MANAGEMENT Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $60,000 - $69,999 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 439.0 1087 71530.963755 -4945.963755 50.0 7.0 0.3 0.4 31 3.0 9 10 86 0.5 12 1.1
3 AF02 SD 201506 I M 15 J 35 0301 2 GS-15 O 10 F 1.0 156737.0 39.8 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF02-AIR FORCE INSPECTION AGENCY (FO) 3 55-59 35 years or more 1 United States 35-NEW MEXICO 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $150,000 - $159,999 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 670.0 265 146735.220304 10001.779696 55.0 2.0 0.4 0.5 34 2.3 12 13 65 0.4 10 1.2
4 AF03 SC 201509 H M 13 B 06 0301 2 GS-13 I 15 F 1.0 92973.0 1.0 1 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE AF03-AIR FORCE OPERATIONAL TEST AND EVALUATION... 4 50-54 1 - 2 years 1 United States 06-CALIFORNIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans GS GS-GENERAL SCHEDULE $90,000 - $99,999 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 721.0 1853 101641.124025 -8668.124025 50.0 7.0 0.4 0.5 38 1.9 12 14 55 0.4 12 1.4
AGYSUB SEP DATECODE AGELVL GENDER GSEGRD LOSLVL LOC OCC PATCO PPGRD SALLVL TOA WORKSCH COUNT SALARY LOS AGYTYP AGYTYPT AGY AGYT AGYSUBT QTR AGELVLT LOSLVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT OCCT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT PAYPLAN PAYPLANT SALLVLT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement BLS_FEDERAL_OtherSep_Rate BLS_FEDERAL_Quits_Rate BLS_FEDERAL_TotalSep_Level BLS_FEDERAL_JobOpenings_Rate BLS_FEDERAL_OtherSep_Level BLS_FEDERAL_Quits_Level BLS_FEDERAL_JobOpenings_Level BLS_FEDERAL_Layoffs_Rate BLS_FEDERAL_Layoffs_Level BLS_FEDERAL_TotalSep_Rate
8170902 ZU00 NS 201509 D NaN NaN C 11 0301 2 AD-00 G 48 F NaN 76377.0 4.8 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 30-34 3 - 4 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $70,000 - $79,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 -39463.182250 30.0 27.0 0.4 0.5 38 1.9 12 14 55 0.4 12 1.4
8170903 ZU00 NS 201509 K NaN NaN D 11 0301 2 AD-00 M 48 F NaN 139517.0 7.0 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 65 or more 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $130,000 - $139,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 23676.817750 65.0 -8.0 0.4 0.5 38 1.9 12 14 55 0.4 12 1.4
8170904 ZU00 NS 201509 K NaN NaN D 11 0301 2 AD-00 O 48 F NaN 158671.0 7.0 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 65 or more 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $150,000 - $159,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 42830.817750 65.0 -8.0 0.4 0.5 38 1.9 12 14 55 0.4 12 1.4
8170905 ZU00 NS 201509 B NaN NaN B 11 0301 2 AD-00 C 48 F NaN 36244.0 1.6 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 20-24 1 - 2 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $30,000 - $39,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 721.0 1391 115840.182250 -79596.182250 20.0 37.0 0.4 0.5 38 1.9 12 14 55 0.4 12 1.4
8170906 ZU00 NS 201509 E NaN NaN D 11 0505 2 AD-00 I 48 F NaN 99288.0 5.0 4 Small Independent Agencies (less than 100 empl... ZU ZU-DWIGHT D. EISENHOWER MEMORIAL COMMISSION ZU00-DWIGHT D. EISENHOWER MEMORIAL COMMISSION 4 35-39 5 - 9 years 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 05 05xx-ACCOUNTING AND BUDGET 0505-FINANCIAL MANAGEMENT Administrative 3 Other White Collar Pay Plans 31 Governmentwide or Multi-Agency Plans AD AD-ADMINISTRATIVELY DETERMINED RATES, NOT ELSE... $90,000 - $99,999 2 Non-permanent 48-Excepted Service - Other 1 Full-time Full-time Nonseasonal 7.0 1391 148382.833333 -49094.833333 35.0 22.0 0.4 0.5 38 1.9 12 14 55 0.4 12 1.4
In [ ]:
 
In [21]:
display(pd.DataFrame({'StratCount' : OPMDataMerged.groupby(["SEP"]).size()}).reset_index())
SEP StratCount
0 NS 7957918
1 SA 26945
2 SB 333
3 SC 66248
4 SD 56820
5 SE 1260
6 SF 4100
7 SG 1467
8 SH 400
9 SI 9728
10 SJ 42754
11 SK 2892
12 SL 42

There are several separation types we would like to either roll up, or remove altogether.

Roll-Up

We have chosen to roll up all retirement separation together. Separation categories of 1) SD,Retirement - Voluntary; 2) SE,Retirement - Early Out; 3) SF,Retirement - Disability; 4) SG,Retirement - Other are consolidated into one category "SD".

Removal

We have chosen to remove the following. 1) SB,Transfer Out - Mass Transfer; 2) SK,Death; 3) SL,Other Separation. 4) SJ,Termination (Expired Appt/Other)

In [22]:
OPMDataMerged = OPMDataMerged[(OPMDataMerged["SEP"] != "SB") & (OPMDataMerged["SEP"] != "SK") & (OPMDataMerged["SEP"] != "SL") & (OPMDataMerged["SEP"] != "SJ")]

OPMDataMerged.loc[(OPMDataMerged["SEP"] == "SD") | (OPMDataMerged["SEP"] == "SE") | (OPMDataMerged["SEP"] == "SF") | (OPMDataMerged["SEP"] == "SG"), "SEP"]="SD"

Preliminary EDA

In terms of data exploration, we first investigate numeric type attributes. Relationships, distributions, and correlation values are reviewed.

A new binary separation attribute is created to indicate whether non-sep or sep for EDA correlation purposes

In [23]:
#%%time
#
#
#cols = list(SampledOPMData.select_dtypes(include=['float64', 'int64']))
#cols.remove('COUNT')
#cols.remove('BLS_FEDERAL_OtherSep_Rate')
#cols.remove('BLS_FEDERAL_Quits_Rate')
#cols.remove('BLS_FEDERAL_TotalSep_Level')
#cols.remove('BLS_FEDERAL_JobOpenings_Rate')
#cols.remove('BLS_FEDERAL_OtherSep_Level')
#cols.remove('BLS_FEDERAL_Quits_Level')
#cols.remove('BLS_FEDERAL_JobOpenings_Level')
#cols.remove('BLS_FEDERAL_Layoffs_Rate')
#cols.remove('BLS_FEDERAL_Layoffs_Level')
#cols.remove('BLS_FEDERAL_TotalSep_Rate')
#cols.append('SEP')
#display(cols)
#
#plotNumeric = SampledOPMData[cols]
#
## Create binary separation attribute for EDA correlation review
##plotNumeric["SEP_bin"] = plotNumeric.SEP.replace("NS", 1)
##plotNumeric.loc[plotNumeric['SEP_bin'] != 1, 'SEP_bin'] = 0
##plotNumeric.SEP_bin = plotNumeric.SEP_bin.apply(pd.to_numeric)
#AttSplit = pd.get_dummies(plotNumeric['SEP'],prefix='SEP')
#display(AttSplit.head())
#plotNumeric = pd.concat((plotNumeric,AttSplit),axis=1) # add back into the dataframe
#
#display(plotNumeric.head())
#print("plotNumeric has {0} Records".format(len(plotNumeric)))
##print(plotNumeric.SEP_bin.dtype)
In [24]:
%%time


cols = list(OPMDataMerged.select_dtypes(include=['float64', 'int64']))
cols.remove('COUNT')
cols.remove('BLS_FEDERAL_OtherSep_Rate')
cols.remove('BLS_FEDERAL_Quits_Rate')
cols.remove('BLS_FEDERAL_TotalSep_Level')
cols.remove('BLS_FEDERAL_JobOpenings_Rate')
cols.remove('BLS_FEDERAL_OtherSep_Level')
cols.remove('BLS_FEDERAL_Quits_Level')
cols.remove('BLS_FEDERAL_JobOpenings_Level')
cols.remove('BLS_FEDERAL_Layoffs_Rate')
cols.remove('BLS_FEDERAL_Layoffs_Level')
cols.remove('BLS_FEDERAL_TotalSep_Rate')
cols.append('SEP')
display(cols)

plotNumeric = OPMDataMerged[cols]

# Create binary separation attribute for EDA correlation review
#plotNumeric["SEP_bin"] = plotNumeric.SEP.replace("NS", 1)
#plotNumeric.loc[plotNumeric['SEP_bin'] != 1, 'SEP_bin'] = 0
#plotNumeric.SEP_bin = plotNumeric.SEP_bin.apply(pd.to_numeric)
AttSplit = pd.get_dummies(plotNumeric['SEP'],prefix='SEP')
display(AttSplit.head())
plotNumeric = pd.concat((plotNumeric,AttSplit),axis=1) # add back into the dataframe

display(plotNumeric.head())
print("plotNumeric has {0} Records".format(len(plotNumeric)))
#print(plotNumeric.SEP_bin.dtype)
['SALARY',
 'LOS',
 'SEPCount_EFDATE_OCC',
 'SEPCount_EFDATE_LOC',
 'IndAvgSalary',
 'SalaryOverUnderIndAvg',
 'LowerLimitAge',
 'YearsToRetirement',
 'SEP']
SEP_NS SEP_SA SEP_SC SEP_SD SEP_SH SEP_SI
0 0 0 1 0 0 0
1 0 0 1 0 0 0
2 0 1 0 0 0 0
3 0 0 0 1 0 0
4 0 0 1 0 0 0
SALARY LOS SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement SEP SEP_NS SEP_SA SEP_SC SEP_SD SEP_SH SEP_SI
0 63722.0 0.8 205.0 1319 64540.593830 -818.593830 25.0 32.0 SC 0 0 1 0 0 0
1 126245.0 4.8 207.0 1132 149864.298504 -23619.298504 30.0 27.0 SC 0 0 1 0 0 0
2 66585.0 4.9 439.0 1087 71530.963755 -4945.963755 50.0 7.0 SA 0 1 0 0 0 0
3 156737.0 39.8 670.0 265 146735.220304 10001.779696 55.0 2.0 SD 0 0 0 1 0 0
4 92973.0 1.0 721.0 1853 101641.124025 -8668.124025 50.0 7.0 SC 0 0 1 0 0 0
plotNumeric has 8124886 Records
CPU times: user 996 ms, sys: 434 ms, total: 1.43 s
Wall time: 1.42 s
In [25]:
%%time

sns.set(font_scale=1)
sns.pairplot(plotNumeric.drop(['SEP_NS',
                               'SEP_SA',
                               'SEP_SC',
                               'SEP_SD',
                               'SEP_SH', 
                               'SEP_SI'], axis=1), hue = 'SEP', palette="hls", plot_kws={"s": 50})
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 37min 43s, sys: 1min 17s, total: 39min
Wall time: 37min 44s
In [26]:
%%time

# Function modified from https://stackoverflow.com/questions/29530355/plotting-multiple-histograms-in-grid
sns.set()

def draw_histograms(df, variables, n_rows, n_cols):
    fig=plt.figure(figsize=(20,20))
    for i, var_name in enumerate(variables):
        ax=fig.add_subplot(n_rows,n_cols,i+1)
        df[var_name].hist(bins=20,ax=ax, color='#58D68D')
        ax.set_title(var_name+" Distribution")
    fig.tight_layout()  # Improves appearance a bit.
    plt.show()

draw_histograms(plotNumeric.drop(['SEP',
                                  'SEP_NS',
                                  'SEP_SA',
                                  'SEP_SC',
                                  'SEP_SD',
                                  'SEP_SH', 
                                  'SEP_SI'], axis=1),
                plotNumeric.drop(['SEP',
                                  'SEP_NS',
                                  'SEP_SA',
                                  'SEP_SC',
                                  'SEP_SD',
                                  'SEP_SH',
                                  'SEP_SI'], axis=1).columns, 6, 3)
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 3.38 s, sys: 1.99 s, total: 5.38 s
Wall time: 3.4 s
In [27]:
%%time
# Inspired by http://seaborn.pydata.org/examples/many_pairwise_correlations.html

#plt.matshow(plotNumeric.corr())

sns.set(style='white')
corr = plotNumeric.drop(['SEP'], axis=1).corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask, k=1)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 20))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(250, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.set(font_scale=0.95)
heatCorr = sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, vmin=-1,
                       square=True, annot=True, linewidths=1,
                       cbar_kws={"shrink": .5}, ax=ax, fmt='.1g')
#heatCorr.
ax.tick_params(labelsize=15)
cax = plt.gcf().axes[-1]
cax.tick_params(labelsize=15)

sns.plt.show()
#sns.heatmap(corr, annot=True, linewidths=0.01, cmap=cmap, ax=ax)
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 4.53 s, sys: 1.18 s, total: 5.71 s
Wall time: 4.69 s

Based on the distribution of attributes identified above, we have decided to take the log transform of several attributes.

  • Salary
  • LOS (augmented by a value of .00001 to adjust for the undefined result of log(0)
  • SEPCount_EFDATE_OCC
  • SEPCount_EFDATE_LOC
In [28]:
%%time

# Log Transform Columns Added
OPMDataMerged["SALARYLog"] = OPMDataMerged["SALARY"].apply(np.log)
OPMDataMerged["LOSLog"] = (OPMDataMerged["LOS"] + .00001).apply(np.log)
OPMDataMerged["SEPCount_EFDATE_OCCLog"] = OPMDataMerged["SEPCount_EFDATE_OCC"].apply(np.log)
OPMDataMerged["SEPCount_EFDATE_LOCLog"] = OPMDataMerged["SEPCount_EFDATE_LOC"].apply(np.log)
OPMDataMerged["IndAvgSalaryLog"] = OPMDataMerged["IndAvgSalary"].apply(np.log)
CPU times: user 1.37 s, sys: 97.2 ms, total: 1.46 s
Wall time: 1.43 s

We next review categorical data to improve our understanding of factor levels.

In [29]:
#%%time
#
#cols = list(SampledOPMData.select_dtypes(include=['object']))
#dropCols = ["LOCTYP",
#            "LOCTYPT",
#            "OCCTYP",
#            "OCCTYPT",
#            "PPTYP",
#            "PPTYPT",
#            "AGYTYP",
#            "OCCFAM",
#            "PPGROUP",
#            "PAYPLAN",
#            "TOATYP",
#            "WSTYP",
#            "AGYSUBT",
#            "AGELVL",
#            "LOSLVL",
#            "LOC",
#            "OCC",
#            "PATCO",
#            "SALLVL",
#            "TOA",
#            "WORKSCH"]
#
#for i in dropCols:
#    cols.remove(i)
#
#plotCat = SampledOPMData[cols]
#display(plotCat.head())
#print("plotCat Has {0} Records".format(len(plotCat)))
#print("Number of colums = ", len(cols))
In [30]:
%%time

cols = list(OPMDataMerged.select_dtypes(include=['object']))
dropCols = ["LOCTYP",
            "LOCTYPT",
            "OCCTYP",
            "OCCTYPT",
            "PPTYP",
            "PPTYPT",
            "AGYTYP",
            "OCCFAM",
            "PPGROUP",
            "PAYPLAN",
            "TOATYP",
            "WSTYP",
            "AGYSUBT",
            "AGELVL",
            "LOSLVL",
            "LOC",
            "OCC",
            "PATCO",
            "SALLVL",
            "TOA",
            "WORKSCH"]

for i in dropCols:
    cols.remove(i)

plotCat = OPMDataMerged[cols]
display(plotCat.head())
print("plotCat Has {0} Records".format(len(plotCat)))
print("Number of colums = ", len(cols))
AGYSUB SEP DATECODE GENDER GSEGRD PPGRD AGYTYPT AGY AGYT QTR AGELVLT LOSLVLT LOCT OCCFAMT OCCT PATCOT PPGROUPT PAYPLANT SALLVLT TOATYPT TOAT WSTYPT WORKSCHT
0 AA00 SC 201507 M 11 GS-11 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES 4 25-29 Less than 1 year 11-DISTRICT OF COLUMBIA 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional Standard GSEG Pay Plans GS-GENERAL SCHEDULE $60,000 - $69,999 Non-permanent 40-Excepted Service - Schedule A Full-time Full-time Nonseasonal
1 AA00 SC 201506 F 15 GS-15 Small Independent Agencies (less than 100 empl... AA AA-ADMINISTRATIVE CONFERENCE OF THE UNITED STATES 3 30-34 3 - 4 years 11-DISTRICT OF COLUMBIA 09xx-LEGAL AND KINDRED 0905-GENERAL ATTORNEY Professional Standard GSEG Pay Plans GS-GENERAL SCHEDULE $120,000 - $129,999 Permanent 30-Excepted Service - Schedule A Full-time Full-time Nonseasonal
2 AF** SA 201503 M 11 GS-11 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE 2 50-54 3 - 4 years 48-TEXAS 22xx-INFORMATION TECHNOLOGY 2210-INFORMATION TECHNOLOGY MANAGEMENT Administrative Standard GSEG Pay Plans GS-GENERAL SCHEDULE $60,000 - $69,999 Permanent 10-Competitive Service - Career Full-time Full-time Nonseasonal
3 AF02 SD 201506 M 15 GS-15 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE 3 55-59 35 years or more 35-NEW MEXICO 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative Standard GSEG Pay Plans GS-GENERAL SCHEDULE $150,000 - $159,999 Permanent 10-Competitive Service - Career Full-time Full-time Nonseasonal
4 AF03 SC 201509 M 13 GS-13 Cabinet Level Agencies AF AF-DEPARTMENT OF THE AIR FORCE 4 50-54 1 - 2 years 06-CALIFORNIA 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS 0301-MISCELLANEOUS ADMINISTRATION AND PROGRAM Administrative Standard GSEG Pay Plans GS-GENERAL SCHEDULE $90,000 - $99,999 Permanent 15-Competitive Service - Career-Conditional Full-time Full-time Nonseasonal
plotCat Has 8124886 Records
Number of colums =  23
CPU times: user 3.12 s, sys: 1.72 s, total: 4.83 s
Wall time: 4.83 s

AGYSUB

High seperation among following:

  • Veterans Health Administration (VATA)
  • Forest Service (AG11)

GENDER

Similar separation distributions among males and females, except more terminations due to contract expiration among males

GSEGRD

High termination due to expired appt/other among following:

  • 3
  • 4
  • 5

Bimodal Quit distribution with outlier spike at GSEGRD 9:

  • Distribution 1 from GSEGRD 3 to 8
  • Distribution 2 from GSEGRD 11 to 15

Individual transfers highest among levels 11, 12, 13

PPGRD

Majority of distribution resides in GS values per the GSEGRD observations described above.... Are other PPGRD values of any significance? What are corporate grades all about?

AGYT

Top three Agencies with separation:

  1. AR-Department of the Army
  2. AG-Department of Agriculture
  3. VA-Department of Veteran Affairs

High contract termination in:

  • AG-Department of Agriculture
  • IN-Department of the Interior

While Veteran Affairs and Army both have many quits and many retirees, the Army has significantly more individual transfers (on par with retirements)

QTR

Most contract terminations in 1st and 4th quarters

Retirement peaks in 2nd quarter

Number of quits increases from one quarter to the next

*Bear in mind these are quarters from single year only so time-sensitive trends may not be applicable*

AGELVLT

High termination due to expired appt/other among following:

  • B
  • C

Number of Quits peaks at AGELVL D

Individual transfer counts mostly trend with Quits

Retirement highest at following:

  • I
  • J
  • K

LOSLVLT

Highest Quit count for LOSLVL A (< 1 year service) which then declines for levels B and C before spiking again at level D (5-9 years service)

Same pattern is observed for contract terminations but without any significant spikes with longer service

Large individual transfer spike at LOSLVL D (5-9 years service)

Retirement starts at LOSLVL D but trends upward to J

LOCT

Contract terminations comprise most California terminations among top total separation states

East Coast locations may possibly have most individual transfers, the most being in Washington DC

OCCFAMT

03xx-General Admin, clerical, and office svcs highest separation by far but indicates both high number of Quits and Retirements

Many quits in 06xx-Medical

04xx-Natural Resources again indicates high number of contract terminations

01xx-Social Science has even number of Quits and retirements

OCCT

PATCOT

PAYPLANT

Results skewed by GS

TOAT

WORKSCHT

Should model full time only

In [31]:
def subCountPlot(att1, att2, thresh):
    counts = plotCat.groupby([att1, att2]).size().unstack(fill_value=0) # Get att1 sizes by att2
    counts = pd.concat([counts,counts.sum(axis=1)], axis=1) # Calculate total for each att1 value and append total as new column
    counts.rename(columns={0:"Total"}, inplace=True)
    top = counts[counts["Total"] > thresh].index.tolist() # Obtain att1 values where total surpasses threshold
    
    zoom = plotCat[plotCat[att1].isin(top)] # Subset data to only the top att1 values
    f, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20, 10), sharey=False)
    sns.countplot(y=att1, data=zoom, color="blue", ax=ax1); # Dark blue signifies zoomed data
    sns.countplot(y=att1, data=zoom, hue=att2, palette="hls", ax=ax2);
In [32]:
def percBarPlot(att1, att2, numColors):
    # Create count by att1 and att2
    counts = plotCat.groupby([att1, att2]).size().unstack(fill_value=0) # Get att1 sizes by att2
    counts = pd.concat([counts,counts.sum(axis=1)], axis=1) # Calculate total for each att1 value and append total as new column
    counts.rename(columns={0:"Total"}, inplace=True)
    #counts.drop('Total', axis=1).plot(kind='bar', stacked=True)
    
    # create cmap from sns color palette
    my_cmap = ListedColormap(sns.color_palette('hls', numColors).as_hex())

    # Create and plot percentage by att1 and att2
    nest1 = []
    for i in counts.values:
        nest2 = []
        for j in i:
            nest2.append(float(j/(i[len(i)-1:]))*100)
        nest1.append(nest2)
    perc = pd.DataFrame(nest1)
    perc = perc.set_index(counts.index.values)
    perc.columns = counts.columns
    perc.drop('Total', axis=1).plot(kind='bar', stacked=True, ylim=(0,100), figsize={13,6}, title=att1+' Percentage Plot', colormap=my_cmap)
In [33]:
temp = cols[:4] # for quick visualization debug only; may delete once complete
In [34]:
%%time

for i in cols:
    if i != 'SEP':
        plt.figure(i) # Required to create new figure each loop rather than drawing over previous object
        f, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20, 10), sharey=False)
        sns.countplot(y=i, data=plotCat, color="lightblue", ax=ax1);
        sns.countplot(y=i, data=plotCat, hue="SEP", palette="hls", ax=ax2);
        
    if i == 'AGYSUB':
        subCountPlot(i, 'SEP', 10000)
    elif i == 'LOCT':
        subCountPlot(i, 'SEP', 4000)
    elif i == 'OCCT':
        subCountPlot(i, 'SEP', 2000)
    elif i == 'PPGRD':
        subCountPlot(i, 'SEP', 6000)
    elif i == 'AGYT':
        subCountPlot(i, 'SEP', 3000)
/usr/local/es7/lib/python3.5/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
CPU times: user 10min 30s, sys: 11.7 s, total: 10min 42s
Wall time: 10min 42s
<matplotlib.figure.Figure at 0x7f770262c358>
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
<matplotlib.figure.Figure at 0x7f76d01c3438>
<matplotlib.figure.Figure at 0x7f76d05879e8>
<matplotlib.figure.Figure at 0x7f76d091df28>
<matplotlib.figure.Figure at 0x7f76cf6bf208>
<matplotlib.figure.Figure at 0x7f76c872dd68>
<matplotlib.figure.Figure at 0x7f76b7d52470>
<matplotlib.figure.Figure at 0x7f76b7c4c9e8>
<matplotlib.figure.Figure at 0x7f76c82383c8>
<matplotlib.figure.Figure at 0x7f76bbf18748>
<matplotlib.figure.Figure at 0x7f76bbda90b8>
<matplotlib.figure.Figure at 0x7f76bbbd2da0>
<matplotlib.figure.Figure at 0x7f76bb43dc18>
<matplotlib.figure.Figure at 0x7f76bad9ad30>
<matplotlib.figure.Figure at 0x7f76c4b37940>
<matplotlib.figure.Figure at 0x7f76c2b84c50>
<matplotlib.figure.Figure at 0x7f76c2857860>
<matplotlib.figure.Figure at 0x7f76c26b1630>
<matplotlib.figure.Figure at 0x7f76c14fdc88>
<matplotlib.figure.Figure at 0x7f76c10ed518>
<matplotlib.figure.Figure at 0x7f76c0fdeac8>
<matplotlib.figure.Figure at 0x7f76c0cfe6a0>
In [35]:
%%time

for i in cols:
    if i != 'SEP':
        percBarPlot(i, 'SEP', len(plotCat.SEP.drop_duplicates()))
/usr/local/es7/lib/python3.5/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
CPU times: user 47.7 s, sys: 2.46 s, total: 50.2 s
Wall time: 50 s
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
In [36]:
percBarPlot('GSEGRD', 'SALLVLT', len(plotCat.SALLVLT.drop_duplicates()))
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
In [37]:
percBarPlot('PATCOT', 'SALLVLT', len(plotCat.SALLVLT.drop_duplicates()))
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
In [38]:
%%time

sns.violinplot(x="PATCOT", y="SALARY", hue="GENDER", data=OPMDataMerged[OPMDataMerged.GENDER != 'Z'], split=True,
               inner="quart", palette={"M": "b", "F": "pink"})
sns.despine(left=True)
CPU times: user 20.9 s, sys: 57.4 s, total: 1min 18s
Wall time: 13.7 s
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
In [39]:
%%time

sns.set(style="whitegrid", palette="pastel", color_codes=True)

# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="SEP", y="SALARY", hue="GENDER", data=OPMDataMerged[OPMDataMerged.GENDER != 'Z'], split=True,
               inner="quart", palette={"M": "b", "F": "pink"})
sns.despine(left=True)
CPU times: user 16.1 s, sys: 33.4 s, total: 49.5 s
Wall time: 11.8 s
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
In [40]:
%%time

sns.factorplot(x="SEP", y="SALARY", hue="GENDER", col="PATCOT",
               data=OPMDataMerged[OPMDataMerged.GENDER != 'Z'],
               kind="violin", split=True, aspect=.4, size=10);
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 34 s, sys: 16.1 s, total: 50.1 s
Wall time: 36.6 s
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x7f76c08b1c88>
In [41]:
%%time

sns.factorplot(x="SEP", y="SALARY", col="PATCOT", data=OPMDataMerged,
               kind="violin", split=True, aspect=.4, size=10, palette = "hls");
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 4min 7s, sys: 20min 30s, total: 24min 38s
Wall time: 1min 4s
Out[41]:
<seaborn.axisgrid.FacetGrid at 0x7f77cc37a710>
In [42]:
%%time

g = sns.PairGrid(data=OPMDataMerged,
                 x_vars=["SEP","PATCOT"],
                 y_vars=["SALARY", "LOS", "LowerLimitAge", "YearsToRetirement"],
                 aspect=1, size=10)
g.map(sns.violinplot, palette="pastel");
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 23min 30s, sys: 1h 55min 1s, total: 2h 18min 31s
Wall time: 5min 40s
In [43]:
del(plotNumeric, plotCat)

Focusing in on our Target Demographic

After analyzing the above plots for our categorical data, we have decided to narrow our focus due to the large variability in the dataset. We take the below actions on our dataset:

  • Keep only Full-time Nonseasonal observations
  • Remove the location US-SUPPRESSED (SEE DATA DEFINITIONS) due to apparent bias towards unknowns in the non-separation data
  • Keep only General Schedele Grades above 7.
  • Focus model generation on White Collar Jobs only
  • Create a Training set for the Professional PATCO value, and a Testing set for Administration

In addition, we have opted to remove the below attributes for model generation:

  • Datecode, QTR; Although very relevant for merging data from alternate sources, we do not have several years of data so this does not bring us much value
  • All Agency Attributes(AGYTYP,AGYTYPT,AGY,AGYT,AGYSUB,AGYSUBT); We are not concerned with agencies
  • Gender; Missing values for Non-Separation observations
  • Count; Missing values for Non-Separation observations; Also, all values = 1 so not very useful
  • PAYPLAN,PAYPLANT,PPGRD; Much too granular than we care for
  • LOSLVL,LOSLVLT; we have a numerical version of this attribute
  • OCC,OCCT; Much too granular than we care for

Our goal is to limit our focus to Professional occupations, build a model, then test that generated model on the Administration segment of the population.

In [44]:
%%time

#Removing Attributes
cols = list(OPMDataMerged.columns)
dropCols = ["QTR",
            "AGYTYP",
            "AGYTYPT",
            "AGY",
            "AGYT",
            "AGYSUB",
            "AGYSUBT",
            "GENDER",
            "COUNT",
            "PAYPLAN",
            "PAYPLANT",
            "PPGRD",
            "LOSLVL",
            "LOSLVLT",
            "SALLVL",
            "SALLVLT",
            "OCC",
            "OCCT"]

for i in dropCols:
    if i in cols:
        cols.remove(i)

OPMDataMerged = OPMDataMerged[cols]

# Keep only Full-time Nonseasonal observations
OPMDataMerged = OPMDataMerged[OPMDataMerged["WORKSCH"] == "F"]

#Remove the location US-SUPPRESSED (SEE DATA DEFINITIONS)
OPMDataMerged = OPMDataMerged[OPMDataMerged["LOC"] != "US"]

#Keep only General Schedele Grades above 7.
OPMDataMerged["GSEGRD"] = OPMDataMerged["GSEGRD"].apply(pd.to_numeric)
OPMDataMerged = OPMDataMerged[OPMDataMerged["GSEGRD"] >= 7]

#Focus model generation on White Collar Jobs only
OPMDataMerged = OPMDataMerged[OPMDataMerged["OCCTYP"] == "1"]

#Create a Training set for the Professional PATCO value, and a Testing set for Administration
OPMDataMergedProf = OPMDataMerged[OPMDataMerged["PATCO"] == "1"]
OPMDataMergedAdmin = OPMDataMerged[OPMDataMerged["PATCO"] == "2"]
CPU times: user 1min 39s, sys: 4.51 s, total: 1min 44s
Wall time: 1min 44s
In [45]:
display(OPMDataMergedProf.head())
print(len(OPMDataMergedProf))
SEP DATECODE AGELVL GSEGRD LOC PATCO TOA WORKSCH SALARY LOS AGELVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement BLS_FEDERAL_OtherSep_Rate BLS_FEDERAL_Quits_Rate BLS_FEDERAL_TotalSep_Level BLS_FEDERAL_JobOpenings_Rate BLS_FEDERAL_OtherSep_Level BLS_FEDERAL_Quits_Level BLS_FEDERAL_JobOpenings_Level BLS_FEDERAL_Layoffs_Rate BLS_FEDERAL_Layoffs_Level BLS_FEDERAL_TotalSep_Rate SALARYLog LOSLog SEPCount_EFDATE_OCCLog SEPCount_EFDATE_LOCLog IndAvgSalaryLog
0 SC 201507 C 11.0 11 1 40 F 63722.0 0.8 25-29 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 2 Non-permanent 40-Excepted Service - Schedule A 1 Full-time Full-time Nonseasonal 205.0 1319 64540.593830 -818.593830 25.0 32.0 0.4 0.5 34 2.6 11 13 74 0.4 10 1.2 11.062285 -0.223131 5.323010 7.184629 11.075050
1 SC 201506 D 15.0 11 1 30 F 126245.0 4.8 30-34 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 09 09xx-LEGAL AND KINDRED Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 30-Excepted Service - Schedule A 1 Full-time Full-time Nonseasonal 207.0 1132 149864.298504 -23619.298504 30.0 27.0 0.4 0.5 34 2.3 12 13 65 0.4 10 1.2 11.745980 1.568618 5.332719 7.031741 11.917485
8 SD 201509 I 14.0 06 1 10 F 135500.0 14.3 55-59 1 United States 06-CALIFORNIA 1 White Collar 08 08xx-ENGINEERING AND ARCHITECTURE Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 122.0 1853 125803.916312 9696.083688 55.0 2.0 0.4 0.5 38 1.9 12 14 55 0.4 12 1.4 11.816727 2.660260 4.804021 7.524561 11.742480
11 SD 201503 J 14.0 08 1 10 F 128223.0 20.6 60-64 1 United States 08-COLORADO 1 White Collar 08 08xx-ENGINEERING AND ARCHITECTURE Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 92.0 329 126328.546349 1894.453651 60.0 -3.0 0.3 0.4 31 3.0 9 10 86 0.5 12 1.1 11.761526 3.025292 4.521789 5.796058 11.746641
14 SA 201508 H 13.0 06 1 10 F 111566.0 24.3 50-54 1 United States 06-CALIFORNIA 1 White Collar 08 08xx-ENGINEERING AND ARCHITECTURE Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 110.0 1606 105047.296509 6518.703491 50.0 7.0 0.5 0.6 41 2.3 14 17 67 0.3 10 1.5 11.622372 3.190477 4.700480 7.381502 11.562166
1282291
In [46]:
display(OPMDataMergedAdmin.head())
print(len(OPMDataMergedAdmin))
SEP DATECODE AGELVL GSEGRD LOC PATCO TOA WORKSCH SALARY LOS AGELVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement BLS_FEDERAL_OtherSep_Rate BLS_FEDERAL_Quits_Rate BLS_FEDERAL_TotalSep_Level BLS_FEDERAL_JobOpenings_Rate BLS_FEDERAL_OtherSep_Level BLS_FEDERAL_Quits_Level BLS_FEDERAL_JobOpenings_Level BLS_FEDERAL_Layoffs_Rate BLS_FEDERAL_Layoffs_Level BLS_FEDERAL_TotalSep_Rate SALARYLog LOSLog SEPCount_EFDATE_OCCLog SEPCount_EFDATE_LOCLog IndAvgSalaryLog
2 SA 201503 H 11.0 48 2 10 F 66585.0 4.9 50-54 1 United States 48-TEXAS 1 White Collar 22 22xx-INFORMATION TECHNOLOGY Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 439.0 1087 71530.963755 -4945.963755 50.0 7.0 0.3 0.4 31 3.0 9 10 86 0.5 12 1.1 11.106235 1.589237 6.084499 6.991177 11.177886
3 SD 201506 I 15.0 35 2 10 F 156737.0 39.8 55-59 1 United States 35-NEW MEXICO 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 670.0 265 146735.220304 10001.779696 55.0 2.0 0.4 0.5 34 2.3 12 13 65 0.4 10 1.2 11.962325 3.683867 6.507278 5.579730 11.896385
4 SC 201509 H 13.0 06 2 15 F 92973.0 1.0 50-54 1 United States 06-CALIFORNIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 721.0 1853 101641.124025 -8668.124025 50.0 7.0 0.4 0.5 38 1.9 12 14 55 0.4 12 1.4 11.440064 0.000010 6.580639 7.524561 11.529203
5 SD 201509 I 13.0 35 2 10 F 102943.0 11.3 55-59 1 United States 35-NEW MEXICO 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 10-Competitive Service - Career 1 Full-time Full-time Nonseasonal 721.0 557 101641.124025 1301.875975 55.0 2.0 0.4 0.5 38 1.9 12 14 55 0.4 12 1.4 11.541931 2.424804 6.580639 6.322565 11.529203
10 SA 201502 F 11.0 35 2 15 F 70621.0 9.7 40-44 1 United States 35-NEW MEXICO 1 White Collar 22 22xx-INFORMATION TECHNOLOGY Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 390.0 169 71530.963755 -909.963755 40.0 17.0 0.3 0.4 26 3.2 8 10 91 0.3 8 1.0 11.165083 2.272127 5.966147 5.129899 11.177886
2128093

Sampling

In [47]:
#curious on stratum SEP counts for full remaining data
stratum = pd.DataFrame({'StratCount' : OPMDataMerged.groupby(["SEP"]).size()}).reset_index()

display(stratum)
SEP StratCount
0 NS 4101470
1 SA 17983
2 SC 22021
3 SD 38956
4 SH 79
5 SI 2727
In [48]:
#Assess Stratum SEP Counts for Prof, for use in sampling
maxSize=4000
stratumProf = pd.DataFrame({'StratCount' : OPMDataMergedProf.groupby(["SEP"]).size()}).reset_index()

stratumProf.loc[stratumProf["StratCount"]>maxSize,"StratCountSample"] = maxSize
stratumProf.loc[stratumProf["StratCount"]<=maxSize,"StratCountSample"] = stratumProf["StratCount"]
#else: stratum["StratCountSample"] = stratum["StratCount"]

display(stratumProf)
SEP StratCount StratCountSample
0 NS 1259283 4000.0
1 SA 5463 4000.0
2 SC 7423 4000.0
3 SD 9476 4000.0
4 SH 15 15.0
5 SI 631 631.0
In [49]:
#Assess Stratum SEP Counts for Admin, for use in sampling
maxSize=4000
stratumAdmin = pd.DataFrame({'StratCount' : OPMDataMergedAdmin.groupby(["SEP"]).size()}).reset_index()

stratumAdmin.loc[stratumAdmin["StratCount"]>maxSize,"StratCountSample"] = maxSize
stratumAdmin.loc[stratumAdmin["StratCount"]<=maxSize,"StratCountSample"] = stratumAdmin["StratCount"]
#else: stratum["StratCountSample"] = stratum["StratCount"]

display(stratumAdmin)
SEP StratCount StratCountSample
0 NS 2087084 4000.0
1 SA 9252 4000.0
2 SC 9156 4000.0
3 SD 21366 4000.0
4 SH 39 39.0
5 SI 1196 1196.0
In [50]:
%%time
def aggStratPop(stratum, OPMDataMerged):
    AggStrat = []

    for i in range(0,len(stratum)):
        sep = stratum["SEP"].ix[i]
        StratCountSample = stratum["StratCountSample"].ix[i]
        print("Stratum Sample Size Calculations for SEP: {}".format(sep))   
        AggStrat.append(pd.DataFrame({'StratCount' : OPMDataMerged[OPMDataMerged["SEP"]==sep].groupby(["DATECODE", "AGELVL"]).size()}).reset_index())
        AggStrat[i]["SEP"] = sep
        AggStrat[i]["TotalCount"] = len(OPMDataMerged[OPMDataMerged["SEP"]==sep])
        AggStrat[i]["p"] = AggStrat[i]["StratCount"] / AggStrat[i]["TotalCount"]
        AggStrat[i]["StratCountSample"] = StratCountSample
        AggStrat[i]["StratSampleSize"] = round(AggStrat[i]["p"] * StratCountSample).apply(int)

        display(AggStrat[i].head())
        print("totalStratumSampleSize: ", AggStrat[i]["StratSampleSize"].sum())
        #print(len(AggStrat[i]))
    return AggStrat
CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 8.82 µs
In [51]:
def SampleStrata(stratum, OPMDataMerged, FileName):
    AggStrat = aggStratPop(stratum, OPMDataMerged)

    SampledOPMStratumDataList = []

    for i,StratSampleSize in enumerate(AggStrat):
        SampledOPMStratumData = []
        for j in range(0,len(StratSampleSize)):
            SEP = StratSampleSize["SEP"].ix[j]
            DATECODE = StratSampleSize["DATECODE"].ix[j]
            AGELVL = StratSampleSize["AGELVL"].ix[j]
            SampleSize = StratSampleSize["StratSampleSize"].ix[j]
            print(SEP, DATECODE, AGELVL, SampleSize)

            SampledOPMStratumDataList.append(OPMDataMerged[(OPMDataMerged["SEP"]==SEP) 
                                                    & (OPMDataMerged["DATECODE"]==DATECODE) 
                                                    & (OPMDataMerged["AGELVL"]==AGELVL)].sample(SampleSize,  random_state=SampleSize))
        SampledOPMStratumData.append(pd.concat(SampledOPMStratumDataList))
        clear_display()
    SampledOPMData = pd.concat(SampledOPMStratumData).reset_index()
    del SampledOPMData["index"]
    pickleObject(SampledOPMData, FileName)
    clear_display()

    return SampledOPMData

Using a seed value equal to each strata sample size, we take random samples according to the computed sizes above. We loop through each Separation Type's Aggregated Strata Sample Sizes; Identify all observations matching on Datecode, Separation Type, and AgeLevel; and finally sample those observations with the computed sample size.

In [52]:
%%time
##Prof Data Sampling
if os.path.isfile(PickleJarPath+"/SampledOPMDataProf.pkl"):
    print("Found the File! Loading Pickle Now!")
    SampledOPMDataProf = unpickleObject("SampledOPMDataProf")
else:
    SampledOPMDataProf= SampleStrata(stratumProf, OPMDataMergedProf, "SampledOPMDataProf")
Stratum Sample Size Calculations for SEP: NS
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201412 B 2421 NS 1259283 0.001923 4000.0 8
1 201412 C 20935 NS 1259283 0.016625 4000.0 66
2 201412 D 38775 NS 1259283 0.030791 4000.0 123
3 201412 E 37920 NS 1259283 0.030112 4000.0 120
4 201412 F 36400 NS 1259283 0.028905 4000.0 116
totalStratumSampleSize:  4003
Stratum Sample Size Calculations for SEP: SA
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201410 B 2 SA 5463 0.000366 4000.0 1
1 201410 C 56 SA 5463 0.010251 4000.0 41
2 201410 D 84 SA 5463 0.015376 4000.0 62
3 201410 E 70 SA 5463 0.012813 4000.0 51
4 201410 F 67 SA 5463 0.012264 4000.0 49
totalStratumSampleSize:  3996
Stratum Sample Size Calculations for SEP: SC
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201410 B 10 SC 7423 0.001347 4000.0 5
1 201410 C 92 SC 7423 0.012394 4000.0 50
2 201410 D 154 SC 7423 0.020746 4000.0 83
3 201410 E 118 SC 7423 0.015897 4000.0 64
4 201410 F 80 SC 7423 0.010777 4000.0 43
totalStratumSampleSize:  3999
Stratum Sample Size Calculations for SEP: SD
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201410 E 2 SD 9476 0.000211 4000.0 1
1 201410 F 5 SD 9476 0.000528 4000.0 2
2 201410 G 6 SD 9476 0.000633 4000.0 3
3 201410 H 14 SD 9476 0.001477 4000.0 6
4 201410 I 179 SD 9476 0.018890 4000.0 76
totalStratumSampleSize:  3994
Stratum Sample Size Calculations for SEP: SH
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201411 E 1 SH 15 0.066667 15.0 1
1 201411 F 1 SH 15 0.066667 15.0 1
2 201411 I 1 SH 15 0.066667 15.0 1
3 201412 C 1 SH 15 0.066667 15.0 1
4 201501 D 1 SH 15 0.066667 15.0 1
totalStratumSampleSize:  15
Stratum Sample Size Calculations for SEP: SI
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201410 C 1 SI 631 0.001585 631.0 1
1 201410 D 8 SI 631 0.012678 631.0 8
2 201410 E 5 SI 631 0.007924 631.0 5
3 201410 F 5 SI 631 0.007924 631.0 5
4 201410 G 6 SI 631 0.009509 631.0 6
totalStratumSampleSize:  631
NS 201412 B 8
NS 201412 C 66
NS 201412 D 123
NS 201412 E 120
NS 201412 F 116
NS 201412 G 128
NS 201412 H 161
NS 201412 I 143
NS 201412 J 92
NS 201412 K 46
NS 201503 B 8
NS 201503 C 65
NS 201503 D 124
NS 201503 E 122
NS 201503 F 115
NS 201503 G 127
NS 201503 H 160
NS 201503 I 142
NS 201503 J 91
NS 201503 K 45
NS 201506 B 8
NS 201506 C 63
NS 201506 D 124
NS 201506 E 124
NS 201506 F 115
NS 201506 G 126
NS 201506 H 159
NS 201506 I 142
NS 201506 J 92
NS 201506 K 45
NS 201509 B 9
NS 201509 C 64
NS 201509 D 125
NS 201509 E 127
NS 201509 F 115
NS 201509 G 126
NS 201509 H 157
NS 201509 I 142
NS 201509 J 92
NS 201509 K 46
SA 201410 B 1
SA 201410 C 41
SA 201410 D 62
SA 201410 E 51
SA 201410 F 49
SA 201410 G 49
SA 201410 H 40
SA 201410 I 29
SA 201410 J 9
SA 201410 K 4
SA 201411 B 1
SA 201411 C 51
SA 201411 D 84
SA 201411 E 70
SA 201411 F 68
SA 201411 G 48
SA 201411 H 56
SA 201411 I 40
SA 201411 J 17
SA 201411 K 2
SA 201412 B 1
SA 201412 C 18
SA 201412 D 35
SA 201412 E 29
SA 201412 F 26
SA 201412 G 26
SA 201412 H 27
SA 201412 I 23
SA 201412 J 7
SA 201412 K 1
SA 201501 B 1
SA 201501 C 24
SA 201501 D 53
SA 201501 E 62
SA 201501 F 37
SA 201501 G 43
SA 201501 H 42
SA 201501 I 23
SA 201501 J 11
SA 201501 K 1
SA 201502 B 1
SA 201502 C 32
SA 201502 D 47
SA 201502 E 37
SA 201502 F 34
SA 201502 G 28
SA 201502 H 41
SA 201502 I 26
SA 201502 J 10
SA 201502 K 1
SA 201503 B 3
SA 201503 C 26
SA 201503 D 70
SA 201503 E 53
SA 201503 F 53
SA 201503 G 64
SA 201503 H 48
SA 201503 I 31
SA 201503 J 10
SA 201503 K 4
SA 201504 C 29
SA 201504 D 56
SA 201504 E 57
SA 201504 F 48
SA 201504 G 33
SA 201504 H 43
SA 201504 I 19
SA 201504 J 10
SA 201504 K 1
SA 201505 B 2
SA 201505 C 48
SA 201505 D 81
SA 201505 E 76
SA 201505 F 70
SA 201505 G 66
SA 201505 H 61
SA 201505 I 49
SA 201505 J 13
SA 201505 K 4
SA 201506 B 5
SA 201506 C 36
SA 201506 D 63
SA 201506 E 49
SA 201506 F 39
SA 201506 G 33
SA 201506 H 52
SA 201506 I 33
SA 201506 J 15
SA 201506 K 1
SA 201507 B 1
SA 201507 C 37
SA 201507 D 59
SA 201507 E 69
SA 201507 F 48
SA 201507 G 42
SA 201507 H 45
SA 201507 I 34
SA 201507 J 9
SA 201507 K 4
SA 201508 B 1
SA 201508 C 37
SA 201508 D 67
SA 201508 E 43
SA 201508 F 42
SA 201508 G 41
SA 201508 H 48
SA 201508 I 29
SA 201508 J 12
SA 201508 K 6
SA 201509 B 1
SA 201509 C 29
SA 201509 D 75
SA 201509 E 55
SA 201509 F 42
SA 201509 G 45
SA 201509 H 48
SA 201509 I 38
SA 201509 J 13
SA 201509 K 3
SC 201410 B 5
SC 201410 C 50
SC 201410 D 83
SC 201410 E 64
SC 201410 F 43
SC 201410 G 33
SC 201410 H 34
SC 201410 I 26
SC 201410 J 13
SC 201410 K 6
SC 201411 B 3
SC 201411 C 39
SC 201411 D 48
SC 201411 E 46
SC 201411 F 38
SC 201411 G 30
SC 201411 H 33
SC 201411 I 22
SC 201411 J 9
SC 201411 K 4
SC 201412 B 2
SC 201412 C 34
SC 201412 D 56
SC 201412 E 44
SC 201412 F 36
SC 201412 G 24
SC 201412 H 25
SC 201412 I 23
SC 201412 J 6
SC 201412 K 8
SC 201501 B 2
SC 201501 C 57
SC 201501 D 71
SC 201501 E 66
SC 201501 F 51
SC 201501 G 38
SC 201501 H 39
SC 201501 I 23
SC 201501 J 12
SC 201501 K 5
SC 201502 B 3
SC 201502 C 34
SC 201502 D 53
SC 201502 E 43
SC 201502 F 38
SC 201502 G 26
SC 201502 H 27
SC 201502 I 24
SC 201502 J 9
SC 201502 K 6
SC 201503 B 5
SC 201503 C 32
SC 201503 D 54
SC 201503 E 49
SC 201503 F 28
SC 201503 G 30
SC 201503 H 20
SC 201503 I 20
SC 201503 J 11
SC 201503 K 6
SC 201504 B 6
SC 201504 C 46
SC 201504 D 56
SC 201504 E 39
SC 201504 F 31
SC 201504 G 31
SC 201504 H 30
SC 201504 I 19
SC 201504 J 7
SC 201504 K 6
SC 201505 B 10
SC 201505 C 75
SC 201505 D 87
SC 201505 E 57
SC 201505 F 57
SC 201505 G 39
SC 201505 H 36
SC 201505 I 30
SC 201505 J 15
SC 201505 K 6
SC 201506 B 9
SC 201506 C 59
SC 201506 D 75
SC 201506 E 68
SC 201506 F 45
SC 201506 G 41
SC 201506 H 37
SC 201506 I 24
SC 201506 J 11
SC 201506 K 9
SC 201507 B 13
SC 201507 C 75
SC 201507 D 76
SC 201507 E 65
SC 201507 F 56
SC 201507 G 38
SC 201507 H 31
SC 201507 I 30
SC 201507 J 16
SC 201507 K 8
SC 201508 B 9
SC 201508 C 57
SC 201508 D 88
SC 201508 E 75
SC 201508 F 49
SC 201508 G 44
SC 201508 H 35
SC 201508 I 26
SC 201508 J 8
SC 201508 K 3
SC 201509 B 6
SC 201509 C 55
SC 201509 D 86
SC 201509 E 52
SC 201509 F 47
SC 201509 G 37
SC 201509 H 36
SC 201509 I 24
SC 201509 J 16
SC 201509 K 8
SD 201410 E 1
SD 201410 F 2
SD 201410 G 3
SD 201410 H 6
SD 201410 I 76
SD 201410 J 98
SD 201410 K 92
SD 201411 D 1
SD 201411 E 0
SD 201411 F 0
SD 201411 G 3
SD 201411 H 8
SD 201411 I 57
SD 201411 J 66
SD 201411 K 58
SD 201412 E 0
SD 201412 F 0
SD 201412 G 2
SD 201412 H 17
SD 201412 I 117
SD 201412 J 190
SD 201412 K 194
SD 201501 D 0
SD 201501 F 0
SD 201501 G 5
SD 201501 H 14
SD 201501 I 239
SD 201501 J 298
SD 201501 K 267
SD 201502 D 1
SD 201502 F 1
SD 201502 G 2
SD 201502 H 6
SD 201502 I 52
SD 201502 J 87
SD 201502 K 67
SD 201503 C 0
SD 201503 E 1
SD 201503 F 2
SD 201503 G 3
SD 201503 H 12
SD 201503 I 45
SD 201503 J 95
SD 201503 K 69
SD 201504 E 1
SD 201504 F 1
SD 201504 G 5
SD 201504 H 11
SD 201504 I 63
SD 201504 J 87
SD 201504 K 89
SD 201505 D 1
SD 201505 E 0
SD 201505 F 1
SD 201505 G 5
SD 201505 H 13
SD 201505 I 114
SD 201505 J 147
SD 201505 K 127
SD 201506 E 0
SD 201506 F 1
SD 201506 G 2
SD 201506 H 7
SD 201506 I 67
SD 201506 J 114
SD 201506 K 89
SD 201507 D 1
SD 201507 G 1
SD 201507 H 9
SD 201507 I 90
SD 201507 J 120
SD 201507 K 106
SD 201508 D 1
SD 201508 F 1
SD 201508 G 2
SD 201508 H 8
SD 201508 I 57
SD 201508 J 90
SD 201508 K 66
SD 201509 D 0
SD 201509 E 0
SD 201509 F 1
SD 201509 G 6
SD 201509 H 8
SD 201509 I 56
SD 201509 J 89
SD 201509 K 80
SH 201411 E 1
SH 201411 F 1
SH 201411 I 1
SH 201412 C 1
SH 201501 D 1
SH 201501 H 1
SH 201505 E 1
SH 201505 F 1
SH 201505 G 1
SH 201505 I 2
SH 201505 J 1
SH 201509 D 1
SH 201509 F 1
SH 201509 I 1
SI 201410 C 1
SI 201410 D 8
SI 201410 E 5
SI 201410 F 5
SI 201410 G 6
SI 201410 H 2
SI 201410 I 8
SI 201410 J 8
SI 201410 K 3
SI 201411 C 1
SI 201411 D 6
SI 201411 E 8
SI 201411 F 3
SI 201411 G 7
SI 201411 H 9
SI 201411 I 7
SI 201411 J 6
SI 201411 K 2
SI 201412 C 4
SI 201412 D 2
SI 201412 E 5
SI 201412 F 5
SI 201412 G 4
SI 201412 H 5
SI 201412 I 6
SI 201412 K 1
SI 201501 C 3
SI 201501 D 4
SI 201501 E 5
SI 201501 F 6
SI 201501 G 9
SI 201501 H 10
SI 201501 I 9
SI 201501 J 1
SI 201501 K 2
SI 201502 C 3
SI 201502 D 5
SI 201502 E 5
SI 201502 F 4
SI 201502 G 7
SI 201502 H 10
SI 201502 I 6
SI 201502 J 5
SI 201502 K 1
SI 201503 C 4
SI 201503 D 5
SI 201503 E 7
SI 201503 F 8
SI 201503 G 7
SI 201503 H 12
SI 201503 I 11
SI 201503 J 5
SI 201503 K 3
SI 201504 B 1
SI 201504 C 3
SI 201504 D 9
SI 201504 E 4
SI 201504 F 8
SI 201504 G 11
SI 201504 H 7
SI 201504 I 4
SI 201504 J 8
SI 201504 K 1
SI 201505 C 2
SI 201505 D 5
SI 201505 E 4
SI 201505 F 7
SI 201505 G 15
SI 201505 H 17
SI 201505 I 7
SI 201505 J 4
SI 201505 K 1
SI 201506 C 2
SI 201506 D 6
SI 201506 E 8
SI 201506 F 7
SI 201506 G 11
SI 201506 H 14
SI 201506 I 10
SI 201507 C 1
SI 201507 D 2
SI 201507 E 9
SI 201507 F 4
SI 201507 G 7
SI 201507 H 10
SI 201507 I 13
SI 201507 J 6
SI 201507 K 3
SI 201508 C 4
SI 201508 D 6
SI 201508 E 7
SI 201508 F 6
SI 201508 G 11
SI 201508 H 10
SI 201508 I 15
SI 201508 J 5
SI 201508 K 2
SI 201509 B 2
SI 201509 C 4
SI 201509 D 5
SI 201509 E 4
SI 201509 F 9
SI 201509 G 5
SI 201509 H 11
SI 201509 I 5
SI 201509 J 4
SI 201509 K 1
CPU times: user 2min 22s, sys: 1.37 s, total: 2min 24s
Wall time: 2min 25s
In [53]:
%%time
print(len(SampledOPMDataProf))
display(SampledOPMDataProf.head())
display(pd.DataFrame({'StratCount' : SampledOPMDataProf.groupby(["SEP"]).size()}).reset_index())
16638
SEP DATECODE AGELVL GSEGRD LOC PATCO TOA WORKSCH SALARY LOS AGELVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement BLS_FEDERAL_OtherSep_Rate BLS_FEDERAL_Quits_Rate BLS_FEDERAL_TotalSep_Level BLS_FEDERAL_JobOpenings_Rate BLS_FEDERAL_OtherSep_Level BLS_FEDERAL_Quits_Level BLS_FEDERAL_JobOpenings_Level BLS_FEDERAL_Layoffs_Rate BLS_FEDERAL_Layoffs_Level BLS_FEDERAL_TotalSep_Rate SALARYLog LOSLog SEPCount_EFDATE_OCCLog SEPCount_EFDATE_LOCLog IndAvgSalaryLog
0 NS 201412 B 11.0 34 1 15 F 65377.0 2.4 20-24 1 United States 34-NEW JERSEY 1 White Collar 08 08xx-ENGINEERING AND ARCHITECTURE Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 50.0 233 66358.662093 -981.662093 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 11.087926 0.875473 3.912023 5.451038 11.102830
1 NS 201412 B 7.0 11 1 20 F 42631.0 0.7 20-24 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 08 08xx-ENGINEERING AND ARCHITECTURE Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 2 Non-permanent 20-Competitive Service 1 Full-time Full-time Nonseasonal 3.0 1260 42631.000000 0.000000 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.660337 -0.356661 1.098612 7.138867 10.660337
2 NS 201412 B 11.0 51 1 15 F 77658.0 2.3 20-24 1 United States 51-VIRGINIA 1 White Collar 12 12xx-COPYRIGHT, PATENT, AND TRADE-MARK Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 26.0 1133 78919.462629 -1261.462629 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 11.260070 0.832913 3.258097 7.032624 11.276183
3 NS 201412 B 9.0 30 1 15 F 47923.0 3.4 20-24 1 United States 30-MONTANA 1 White Collar 13 13xx-PHYSICAL SCIENCES Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 9.0 198 53700.843750 -5777.843750 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.777351 1.223778 2.197225 5.288267 10.891184
4 NS 201412 B 7.0 42 1 15 F 54911.0 5.0 20-24 1 United States 42-PENNSYLVANIA 1 White Collar 08 08xx-ENGINEERING AND ARCHITECTURE Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 23.0 558 49910.782051 5000.217949 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.913469 1.609440 3.135494 6.324359 10.817992
SEP StratCount
0 NS 4003
1 SA 3996
2 SC 3999
3 SD 3994
4 SH 15
5 SI 631
CPU times: user 71.4 ms, sys: 5.98 ms, total: 77.4 ms
Wall time: 69.1 ms
In [54]:
%%time
#### Analyze Missing Values
filtered_msnoData = msno.nullity_sort(msno.nullity_filter(SampledOPMDataProf, filter='bottom', n=15, p=0.999), sort='descending')
msno.matrix(filtered_msnoData)

del filtered_msnoData
/usr/local/es7/lib/python3.5/site-packages/matplotlib/axes/_base.py:2903: UserWarning: Attempting to set identical left==right results
in singular transformations; automatically expanding.
left=-0.5, right=-0.5
  'left=%s, right=%s') % (left, right))
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 389 ms, sys: 317 ms, total: 706 ms
Wall time: 341 ms
In [55]:
%%time
##Admin Data Sampling
if os.path.isfile(PickleJarPath+"/SampledOPMDataAdmin.pkl"):
    print("Found the File! Loading Pickle Now!")
    SampledOPMDataAdmin = unpickleObject("SampledOPMDataAdmin")
else:
    SampledOPMDataAdmin= SampleStrata(stratumAdmin, OPMDataMergedAdmin, "SampledOPMDataAdmin")
Stratum Sample Size Calculations for SEP: NS
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201412 B 2077 NS 2087084 0.000995 4000.0 4
1 201412 C 19966 NS 2087084 0.009566 4000.0 38
2 201412 D 46086 NS 2087084 0.022082 4000.0 88
3 201412 E 50390 NS 2087084 0.024144 4000.0 97
4 201412 F 59393 NS 2087084 0.028457 4000.0 114
totalStratumSampleSize:  4001
Stratum Sample Size Calculations for SEP: SA
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201410 B 4 SA 9252 0.000432 4000.0 2
1 201410 C 65 SA 9252 0.007026 4000.0 28
2 201410 D 157 SA 9252 0.016969 4000.0 68
3 201410 E 106 SA 9252 0.011457 4000.0 46
4 201410 F 119 SA 9252 0.012862 4000.0 51
totalStratumSampleSize:  4002
Stratum Sample Size Calculations for SEP: SC
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201410 B 19 SC 9156 0.002075 4000.0 8
1 201410 C 94 SC 9156 0.010266 4000.0 41
2 201410 D 173 SC 9156 0.018895 4000.0 76
3 201410 E 134 SC 9156 0.014635 4000.0 59
4 201410 F 81 SC 9156 0.008847 4000.0 35
totalStratumSampleSize:  4000
Stratum Sample Size Calculations for SEP: SD
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201410 C 3 SD 21366 0.000140 4000.0 1
1 201410 D 3 SD 21366 0.000140 4000.0 1
2 201410 E 2 SD 21366 0.000094 4000.0 0
3 201410 F 11 SD 21366 0.000515 4000.0 2
4 201410 G 19 SD 21366 0.000889 4000.0 4
totalStratumSampleSize:  4000
Stratum Sample Size Calculations for SEP: SH
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201410 H 1 SH 39 0.025641 39.0 1
1 201410 I 2 SH 39 0.051282 39.0 2
2 201412 D 1 SH 39 0.025641 39.0 1
3 201412 E 1 SH 39 0.025641 39.0 1
4 201412 F 1 SH 39 0.025641 39.0 1
totalStratumSampleSize:  39
Stratum Sample Size Calculations for SEP: SI
DATECODE AGELVL StratCount SEP TotalCount p StratCountSample StratSampleSize
0 201410 C 6 SI 1196 0.005017 1196.0 6
1 201410 D 6 SI 1196 0.005017 1196.0 6
2 201410 E 14 SI 1196 0.011706 1196.0 14
3 201410 F 12 SI 1196 0.010033 1196.0 12
4 201410 G 15 SI 1196 0.012542 1196.0 15
totalStratumSampleSize:  1196
NS 201412 B 4
NS 201412 C 38
NS 201412 D 88
NS 201412 E 97
NS 201412 F 114
NS 201412 G 166
NS 201412 H 209
NS 201412 I 165
NS 201412 J 92
NS 201412 K 40
NS 201503 B 4
NS 201503 C 37
NS 201503 D 87
NS 201503 E 98
NS 201503 F 113
NS 201503 G 164
NS 201503 H 207
NS 201503 I 163
NS 201503 J 90
NS 201503 K 38
NS 201506 B 4
NS 201506 C 36
NS 201506 D 86
NS 201506 E 98
NS 201506 F 111
NS 201506 G 161
NS 201506 H 205
NS 201506 I 162
NS 201506 J 90
NS 201506 K 37
NS 201509 B 4
NS 201509 C 35
NS 201509 D 86
NS 201509 E 101
NS 201509 F 111
NS 201509 G 161
NS 201509 H 206
NS 201509 I 164
NS 201509 J 91
NS 201509 K 38
SA 201410 B 2
SA 201410 C 28
SA 201410 D 68
SA 201410 E 46
SA 201410 F 51
SA 201410 G 61
SA 201410 H 53
SA 201410 I 24
SA 201410 J 6
SA 201410 K 1
SA 201411 B 3
SA 201411 C 32
SA 201411 D 71
SA 201411 E 65
SA 201411 F 75
SA 201411 G 67
SA 201411 H 59
SA 201411 I 37
SA 201411 J 10
SA 201411 K 3
SA 201412 C 17
SA 201412 D 34
SA 201412 E 38
SA 201412 F 32
SA 201412 G 32
SA 201412 H 26
SA 201412 I 17
SA 201412 J 6
SA 201412 K 1
SA 201501 B 1
SA 201501 C 24
SA 201501 D 41
SA 201501 E 51
SA 201501 F 57
SA 201501 G 47
SA 201501 H 42
SA 201501 I 22
SA 201501 J 8
SA 201501 K 2
SA 201502 B 3
SA 201502 C 30
SA 201502 D 48
SA 201502 E 38
SA 201502 F 39
SA 201502 G 53
SA 201502 H 41
SA 201502 I 26
SA 201502 J 7
SA 201502 K 3
SA 201503 B 0
SA 201503 C 29
SA 201503 D 63
SA 201503 E 57
SA 201503 F 41
SA 201503 G 56
SA 201503 H 58
SA 201503 I 29
SA 201503 J 12
SA 201503 K 4
SA 201504 B 2
SA 201504 C 22
SA 201504 D 48
SA 201504 E 44
SA 201504 F 49
SA 201504 G 45
SA 201504 H 46
SA 201504 I 23
SA 201504 J 7
SA 201504 K 2
SA 201505 B 0
SA 201505 C 35
SA 201505 D 83
SA 201505 E 72
SA 201505 F 60
SA 201505 G 69
SA 201505 H 74
SA 201505 I 44
SA 201505 J 10
SA 201505 K 3
SA 201506 C 24
SA 201506 D 49
SA 201506 E 48
SA 201506 F 41
SA 201506 G 44
SA 201506 H 47
SA 201506 I 29
SA 201506 J 6
SA 201506 K 1
SA 201507 B 0
SA 201507 C 20
SA 201507 D 69
SA 201507 E 49
SA 201507 F 49
SA 201507 G 58
SA 201507 H 61
SA 201507 I 35
SA 201507 J 17
SA 201507 K 5
SA 201508 B 1
SA 201508 C 20
SA 201508 D 56
SA 201508 E 52
SA 201508 F 45
SA 201508 G 57
SA 201508 H 55
SA 201508 I 33
SA 201508 J 9
SA 201508 K 3
SA 201509 B 1
SA 201509 C 32
SA 201509 D 68
SA 201509 E 71
SA 201509 F 53
SA 201509 G 58
SA 201509 H 54
SA 201509 I 38
SA 201509 J 7
SA 201509 K 2
SC 201410 B 8
SC 201410 C 41
SC 201410 D 76
SC 201410 E 59
SC 201410 F 35
SC 201410 G 49
SC 201410 H 48
SC 201410 I 25
SC 201410 J 14
SC 201410 K 4
SC 201411 B 4
SC 201411 C 36
SC 201411 D 51
SC 201411 E 48
SC 201411 F 43
SC 201411 G 41
SC 201411 H 38
SC 201411 I 21
SC 201411 J 6
SC 201411 K 3
SC 201412 B 1
SC 201412 C 32
SC 201412 D 49
SC 201412 E 36
SC 201412 F 30
SC 201412 G 33
SC 201412 H 35
SC 201412 I 21
SC 201412 J 10
SC 201412 K 3
SC 201501 B 3
SC 201501 C 32
SC 201501 D 66
SC 201501 E 61
SC 201501 F 52
SC 201501 G 43
SC 201501 H 42
SC 201501 I 31
SC 201501 J 11
SC 201501 K 3
SC 201502 B 6
SC 201502 C 32
SC 201502 D 57
SC 201502 E 39
SC 201502 F 39
SC 201502 G 38
SC 201502 H 40
SC 201502 I 25
SC 201502 J 9
SC 201502 K 3
SC 201503 B 3
SC 201503 C 39
SC 201503 D 61
SC 201503 E 52
SC 201503 F 28
SC 201503 G 38
SC 201503 H 46
SC 201503 I 26
SC 201503 J 9
SC 201503 K 3
SC 201504 B 6
SC 201504 C 35
SC 201504 D 68
SC 201504 E 46
SC 201504 F 49
SC 201504 G 38
SC 201504 H 42
SC 201504 I 21
SC 201504 J 8
SC 201504 K 2
SC 201505 B 10
SC 201505 C 47
SC 201505 D 77
SC 201505 E 66
SC 201505 F 48
SC 201505 G 47
SC 201505 H 49
SC 201505 I 31
SC 201505 J 15
SC 201505 K 7
SC 201506 B 12
SC 201506 C 37
SC 201506 D 63
SC 201506 E 52
SC 201506 F 49
SC 201506 G 41
SC 201506 H 35
SC 201506 I 27
SC 201506 J 7
SC 201506 K 5
SC 201507 B 11
SC 201507 C 52
SC 201507 D 76
SC 201507 E 64
SC 201507 F 59
SC 201507 G 52
SC 201507 H 44
SC 201507 I 20
SC 201507 J 11
SC 201507 K 4
SC 201508 B 13
SC 201508 C 54
SC 201508 D 73
SC 201508 E 60
SC 201508 F 49
SC 201508 G 50
SC 201508 H 40
SC 201508 I 24
SC 201508 J 8
SC 201508 K 3
SC 201509 B 6
SC 201509 C 35
SC 201509 D 69
SC 201509 E 63
SC 201509 F 54
SC 201509 G 51
SC 201509 H 39
SC 201509 I 26
SC 201509 J 8
SC 201509 K 5
SD 201410 C 1
SD 201410 D 1
SD 201410 E 0
SD 201410 F 2
SD 201410 G 4
SD 201410 H 15
SD 201410 I 88
SD 201410 J 95
SD 201410 K 54
SD 201411 C 0
SD 201411 D 0
SD 201411 E 1
SD 201411 F 2
SD 201411 G 4
SD 201411 H 14
SD 201411 I 72
SD 201411 J 65
SD 201411 K 45
SD 201412 D 0
SD 201412 E 0
SD 201412 F 1
SD 201412 G 10
SD 201412 H 40
SD 201412 I 148
SD 201412 J 206
SD 201412 K 163
SD 201501 D 0
SD 201501 E 0
SD 201501 F 1
SD 201501 G 6
SD 201501 H 30
SD 201501 I 269
SD 201501 J 292
SD 201501 K 188
SD 201502 E 1
SD 201502 F 2
SD 201502 G 2
SD 201502 H 10
SD 201502 I 60
SD 201502 J 74
SD 201502 K 44
SD 201503 D 0
SD 201503 E 1
SD 201503 F 1
SD 201503 G 4
SD 201503 H 15
SD 201503 I 66
SD 201503 J 78
SD 201503 K 57
SD 201504 D 1
SD 201504 E 2
SD 201504 F 3
SD 201504 G 4
SD 201504 H 17
SD 201504 I 81
SD 201504 J 89
SD 201504 K 60
SD 201505 D 1
SD 201505 E 1
SD 201505 F 3
SD 201505 G 5
SD 201505 H 24
SD 201505 I 124
SD 201505 J 135
SD 201505 K 97
SD 201506 D 0
SD 201506 E 0
SD 201506 F 1
SD 201506 G 4
SD 201506 H 15
SD 201506 I 99
SD 201506 J 104
SD 201506 K 72
SD 201507 D 1
SD 201507 E 1
SD 201507 F 3
SD 201507 G 5
SD 201507 H 26
SD 201507 I 109
SD 201507 J 117
SD 201507 K 72
SD 201508 D 0
SD 201508 E 1
SD 201508 F 0
SD 201508 G 5
SD 201508 H 20
SD 201508 I 71
SD 201508 J 79
SD 201508 K 50
SD 201509 C 0
SD 201509 D 0
SD 201509 E 1
SD 201509 F 1
SD 201509 G 5
SD 201509 H 19
SD 201509 I 80
SD 201509 J 105
SD 201509 K 55
SH 201410 H 1
SH 201410 I 2
SH 201412 D 1
SH 201412 E 1
SH 201412 F 1
SH 201412 H 1
SH 201505 E 1
SH 201505 G 1
SH 201505 H 3
SH 201505 I 1
SH 201505 J 1
SH 201506 E 2
SH 201507 G 1
SH 201507 H 1
SH 201509 C 1
SH 201509 F 3
SH 201509 G 8
SH 201509 H 6
SH 201509 I 1
SH 201509 J 1
SH 201509 K 1
SI 201410 C 6
SI 201410 D 6
SI 201410 E 14
SI 201410 F 12
SI 201410 G 15
SI 201410 H 21
SI 201410 I 13
SI 201410 J 6
SI 201410 K 3
SI 201411 B 2
SI 201411 C 3
SI 201411 D 6
SI 201411 E 12
SI 201411 F 15
SI 201411 G 22
SI 201411 H 14
SI 201411 I 6
SI 201411 J 3
SI 201411 K 1
SI 201412 B 1
SI 201412 C 2
SI 201412 D 4
SI 201412 E 5
SI 201412 F 8
SI 201412 G 14
SI 201412 H 19
SI 201412 I 8
SI 201412 J 3
SI 201412 K 2
SI 201501 C 2
SI 201501 D 6
SI 201501 E 11
SI 201501 F 14
SI 201501 G 17
SI 201501 H 21
SI 201501 I 12
SI 201501 J 3
SI 201501 K 2
SI 201502 C 6
SI 201502 D 10
SI 201502 E 11
SI 201502 F 15
SI 201502 G 11
SI 201502 H 16
SI 201502 I 10
SI 201502 J 3
SI 201503 B 2
SI 201503 C 3
SI 201503 D 6
SI 201503 E 17
SI 201503 F 16
SI 201503 G 20
SI 201503 H 30
SI 201503 I 17
SI 201503 J 5
SI 201503 K 2
SI 201504 C 5
SI 201504 D 5
SI 201504 E 15
SI 201504 F 18
SI 201504 G 22
SI 201504 H 23
SI 201504 I 13
SI 201504 J 6
SI 201504 K 1
SI 201505 C 3
SI 201505 D 15
SI 201505 E 12
SI 201505 F 17
SI 201505 G 22
SI 201505 H 32
SI 201505 I 5
SI 201505 J 9
SI 201505 K 2
SI 201506 C 3
SI 201506 D 9
SI 201506 E 12
SI 201506 F 14
SI 201506 G 25
SI 201506 H 20
SI 201506 I 15
SI 201506 J 9
SI 201506 K 2
SI 201507 C 3
SI 201507 D 16
SI 201507 E 15
SI 201507 F 26
SI 201507 G 24
SI 201507 H 23
SI 201507 I 16
SI 201507 J 3
SI 201507 K 2
SI 201508 B 1
SI 201508 C 5
SI 201508 D 17
SI 201508 E 13
SI 201508 F 10
SI 201508 G 21
SI 201508 H 24
SI 201508 I 10
SI 201508 J 5
SI 201508 K 3
SI 201509 B 1
SI 201509 C 3
SI 201509 D 10
SI 201509 E 7
SI 201509 F 16
SI 201509 G 19
SI 201509 H 22
SI 201509 I 11
SI 201509 K 2
CPU times: user 3min 54s, sys: 1.58 s, total: 3min 55s
Wall time: 3min 54s
In [56]:
%%time
print(len(SampledOPMDataAdmin))
display(SampledOPMDataAdmin.head())
display(pd.DataFrame({'StratCount' : SampledOPMDataAdmin.groupby(["SEP"]).size()}).reset_index())
17238
SEP DATECODE AGELVL GSEGRD LOC PATCO TOA WORKSCH SALARY LOS AGELVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement BLS_FEDERAL_OtherSep_Rate BLS_FEDERAL_Quits_Rate BLS_FEDERAL_TotalSep_Level BLS_FEDERAL_JobOpenings_Rate BLS_FEDERAL_OtherSep_Level BLS_FEDERAL_Quits_Level BLS_FEDERAL_JobOpenings_Level BLS_FEDERAL_Layoffs_Rate BLS_FEDERAL_Layoffs_Level BLS_FEDERAL_TotalSep_Rate SALARYLog LOSLog SEPCount_EFDATE_OCCLog SEPCount_EFDATE_LOCLog IndAvgSalaryLog
0 NS 201412 B 7.0 42 2 30 F 39179.0 0.5 20-24 1 United States 42-PENNSYLVANIA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 30-Excepted Service - Schedule A 1 Full-time Full-time Nonseasonal 482.0 558 44301.350294 -5122.350294 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.575896 -0.693127 6.177944 6.324359 10.698770
1 NS 201412 B 7.0 12 2 15 F 39179.0 2.3 20-24 1 United States 12-FLORIDA 1 White Collar 03 03xx-GENERAL ADMIN, CLERICAL, & OFFICE SVCS Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 482.0 769 44301.350294 -5122.350294 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.575896 0.832913 6.177944 6.645091 10.698770
2 NS 201412 B 7.0 27 2 35 F 41512.0 0.5 20-24 1 United States 27-MINNESOTA 1 White Collar 09 09xx-LEGAL AND KINDRED Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 35-Excepted Service - Schedule D 1 Full-time Full-time Nonseasonal 60.0 182 42440.609454 -928.609454 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.633738 -0.693127 4.094345 5.204007 10.655861
3 NS 201412 B 9.0 48 2 15 F 54573.0 2.8 20-24 1 United States 48-TEXAS 1 White Collar 22 22xx-INFORMATION TECHNOLOGY Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 498.0 1406 61424.170625 -6851.170625 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.907295 1.029623 6.210600 7.248504 11.025559
4 NS 201412 C 12.0 11 2 15 F 78142.0 3.0 25-29 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 22 22xx-INFORMATION TECHNOLOGY Administrative 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 498.0 1260 85141.768812 -6999.768812 25.0 32.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 11.266283 1.098616 6.210600 7.138867 11.352073
SEP StratCount
0 NS 4001
1 SA 4002
2 SC 4000
3 SD 4000
4 SH 39
5 SI 1196
CPU times: user 81.2 ms, sys: 9.66 ms, total: 90.8 ms
Wall time: 82.3 ms
In [57]:
%%time
#### Analyze Missing Values
filtered_msnoData = msno.nullity_sort(msno.nullity_filter(SampledOPMDataAdmin, filter='bottom', n=15, p=0.999), sort='descending')
msno.matrix(filtered_msnoData)

del filtered_msnoData
/usr/local/es7/lib/python3.5/site-packages/matplotlib/axes/_base.py:2903: UserWarning: Attempting to set identical left==right results
in singular transformations; automatically expanding.
left=-0.5, right=-0.5
  'left=%s, right=%s') % (left, right))
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 410 ms, sys: 303 ms, total: 713 ms
Wall time: 344 ms
In [58]:
%%time
## Describe Summary for our Model Professional Subgroup for Modeling
display(SampledOPMDataProf.describe().transpose())
count mean std min 25% 50% 75% max
GSEGRD 16638.0 12.138118 1.734298 7.000000 11.000000 12.000000 13.000000 15.000000
SALARY 16638.0 94989.438094 30376.972813 39179.000000 73444.000000 90344.000000 111988.500000 326293.000000
LOS 16638.0 13.674408 11.832424 0.000000 4.500000 9.000000 22.900000 71.500000
SEPCount_EFDATE_OCC 16638.0 150.025844 162.239891 1.000000 33.000000 83.000000 240.000000 708.000000
SEPCount_EFDATE_LOC 16638.0 739.177726 494.479945 30.000000 316.000000 596.000000 1123.000000 2791.000000
IndAvgSalary 16638.0 94678.112257 29101.241575 40129.225806 70610.296346 86204.977852 105814.494876 224893.272727
SalaryOverUnderIndAvg 16638.0 311.325836 9844.071820 -114403.045059 -5549.000502 -119.923867 6466.667452 125012.771739
LowerLimitAge 16638.0 45.535221 12.806462 20.000000 35.000000 45.000000 55.000000 65.000000
YearsToRetirement 16638.0 11.464779 12.806462 -8.000000 2.000000 12.000000 22.000000 37.000000
BLS_FEDERAL_OtherSep_Rate 16638.0 0.428603 0.084367 0.300000 0.400000 0.400000 0.500000 0.600000
BLS_FEDERAL_Quits_Rate 16638.0 0.455001 0.069551 0.300000 0.400000 0.500000 0.500000 0.600000
BLS_FEDERAL_TotalSep_Level 16638.0 36.236086 8.427340 26.000000 31.000000 34.000000 38.000000 60.000000
BLS_FEDERAL_JobOpenings_Rate 16638.0 2.453342 0.390810 1.900000 2.200000 2.300000 2.800000 3.200000
BLS_FEDERAL_OtherSep_Level 16638.0 11.938094 2.378656 8.000000 10.000000 12.000000 12.000000 17.000000
BLS_FEDERAL_Quits_Level 16638.0 12.007453 2.070046 9.000000 10.000000 13.000000 13.000000 17.000000
BLS_FEDERAL_JobOpenings_Level 16638.0 69.798233 11.397915 55.000000 62.000000 67.000000 80.000000 91.000000
BLS_FEDERAL_Layoffs_Rate 16638.0 0.459298 0.213852 0.300000 0.400000 0.400000 0.500000 1.100000
BLS_FEDERAL_Layoffs_Level 16638.0 12.374144 5.885802 7.000000 10.000000 12.000000 12.000000 30.000000
BLS_FEDERAL_TotalSep_Rate 16638.0 1.312351 0.313077 1.000000 1.100000 1.200000 1.400000 2.200000
SALARYLog 16638.0 11.413138 0.309993 10.575896 11.204278 11.411380 11.626151 12.695551
LOSLog 16638.0 2.036026 1.526209 -11.512925 1.504080 2.197226 3.131137 4.269698
SEPCount_EFDATE_OCCLog 16638.0 4.328885 1.333366 0.000000 3.496508 4.418841 5.480639 6.562444
SEPCount_EFDATE_LOCLog 16638.0 6.324450 0.821291 3.401197 5.755742 6.390241 7.023759 7.934155
IndAvgSalaryLog 16638.0 11.414288 0.294107 10.599860 11.164931 11.364483 11.569443 12.323381
CPU times: user 76.2 ms, sys: 86.9 ms, total: 163 ms
Wall time: 62.5 ms
In [59]:
#%%time

#OPMDataMerged.to_csv("OPMDataMerged.csv")
In [60]:
#os.path.getsize("OPMDataMerged.csv") #Display file size in bytes

Review Visualizations post-Data removal and sampling

Chris... can you use the SampledOPMDataProf dataset, and re-run the Visuals?

In [61]:
%%time


cols = list(SampledOPMDataProf.select_dtypes(include=['float64', 'int64']))
cols.remove('BLS_FEDERAL_OtherSep_Rate')
cols.remove('BLS_FEDERAL_Quits_Rate')
cols.remove('BLS_FEDERAL_TotalSep_Level')
cols.remove('BLS_FEDERAL_JobOpenings_Rate')
cols.remove('BLS_FEDERAL_OtherSep_Level')
cols.remove('BLS_FEDERAL_Quits_Level')
cols.remove('BLS_FEDERAL_JobOpenings_Level')
cols.remove('BLS_FEDERAL_Layoffs_Rate')
cols.remove('BLS_FEDERAL_Layoffs_Level')
cols.remove('BLS_FEDERAL_TotalSep_Rate')
cols.append('SEP')
display(cols)

plotNumeric = SampledOPMDataProf[cols]

# Create binary separation attribute for EDA correlation review
#plotNumeric["SEP_bin"] = plotNumeric.SEP.replace("NS", 1)
#plotNumeric.loc[plotNumeric['SEP_bin'] != 1, 'SEP_bin'] = 0
#plotNumeric.SEP_bin = plotNumeric.SEP_bin.apply(pd.to_numeric)
AttSplit = pd.get_dummies(plotNumeric['SEP'],prefix='SEP')
display(AttSplit.head())
plotNumeric = pd.concat((plotNumeric,AttSplit),axis=1) # add back into the dataframe

display(plotNumeric.head())
print("plotNumeric has {0} Records".format(len(plotNumeric)))
#print(plotNumeric.SEP_bin.dtype)
['GSEGRD',
 'SALARY',
 'LOS',
 'SEPCount_EFDATE_OCC',
 'SEPCount_EFDATE_LOC',
 'IndAvgSalary',
 'SalaryOverUnderIndAvg',
 'LowerLimitAge',
 'YearsToRetirement',
 'SALARYLog',
 'LOSLog',
 'SEPCount_EFDATE_OCCLog',
 'SEPCount_EFDATE_LOCLog',
 'IndAvgSalaryLog',
 'SEP']
SEP_NS SEP_SA SEP_SC SEP_SD SEP_SH SEP_SI
0 1 0 0 0 0 0
1 1 0 0 0 0 0
2 1 0 0 0 0 0
3 1 0 0 0 0 0
4 1 0 0 0 0 0
GSEGRD SALARY LOS SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement SALARYLog LOSLog SEPCount_EFDATE_OCCLog SEPCount_EFDATE_LOCLog IndAvgSalaryLog SEP SEP_NS SEP_SA SEP_SC SEP_SD SEP_SH SEP_SI
0 11.0 65377.0 2.4 50.0 233 66358.662093 -981.662093 20.0 37.0 11.087926 0.875473 3.912023 5.451038 11.102830 NS 1 0 0 0 0 0
1 7.0 42631.0 0.7 3.0 1260 42631.000000 0.000000 20.0 37.0 10.660337 -0.356661 1.098612 7.138867 10.660337 NS 1 0 0 0 0 0
2 11.0 77658.0 2.3 26.0 1133 78919.462629 -1261.462629 20.0 37.0 11.260070 0.832913 3.258097 7.032624 11.276183 NS 1 0 0 0 0 0
3 9.0 47923.0 3.4 9.0 198 53700.843750 -5777.843750 20.0 37.0 10.777351 1.223778 2.197225 5.288267 10.891184 NS 1 0 0 0 0 0
4 7.0 54911.0 5.0 23.0 558 49910.782051 5000.217949 20.0 37.0 10.913469 1.609440 3.135494 6.324359 10.817992 NS 1 0 0 0 0 0
plotNumeric has 16638 Records
CPU times: user 25.1 ms, sys: 10.2 ms, total: 35.3 ms
Wall time: 30.8 ms
In [62]:
%%time

sns.set(font_scale=1)
sns.pairplot(plotNumeric.drop(['SEP_NS',
                               'SEP_SA',
                               'SEP_SC',
                               'SEP_SD',
                               'SEP_SH', 
                               'SEP_SI'], axis=1), hue = 'SEP', palette="hls", plot_kws={"s": 50})
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 52.1 s, sys: 50 s, total: 1min 42s
Wall time: 42.2 s
In [63]:
%%time

# Function modified from https://stackoverflow.com/questions/29530355/plotting-multiple-histograms-in-grid
sns.set()

def draw_histograms(df, variables, n_rows, n_cols):
    fig=plt.figure(figsize=(20,20))
    for i, var_name in enumerate(variables):
        ax=fig.add_subplot(n_rows,n_cols,i+1)
        df[var_name].hist(bins=20,ax=ax, color='#58D68D')
        ax.set_title(var_name+" Distribution")
    fig.tight_layout()  # Improves appearance a bit.
    plt.show()

draw_histograms(plotNumeric.drop(['SEP',
                                  'SEP_NS',
                                  'SEP_SA',
                                  'SEP_SC',
                                  'SEP_SD',
                                  'SEP_SH', 
                                  'SEP_SI'], axis=1),
                plotNumeric.drop(['SEP',
                                  'SEP_NS',
                                  'SEP_SA',
                                  'SEP_SC',
                                  'SEP_SD',
                                  'SEP_SH',
                                  'SEP_SI'], axis=1).columns, 6, 3)
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 4.45 s, sys: 2.79 s, total: 7.24 s
Wall time: 3.95 s
In [64]:
%%time
# Inspired by http://seaborn.pydata.org/examples/many_pairwise_correlations.html

#plt.matshow(plotNumeric.corr())

sns.set(style='white')
corr = plotNumeric.drop(['SEP'], axis=1).corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask, k=1)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 20))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(250, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.set(font_scale=0.95)
heatCorr = sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, vmin=-1,
                       square=True, annot=True, linewidths=1,
                       cbar_kws={"shrink": .5}, ax=ax, fmt='.1g')
#heatCorr.
ax.tick_params(labelsize=15)
cax = plt.gcf().axes[-1]
cax.tick_params(labelsize=15)

sns.plt.show()
#sns.heatmap(corr, annot=True, linewidths=0.01, cmap=cmap, ax=ax)
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 2.31 s, sys: 952 ms, total: 3.26 s
Wall time: 2.22 s
In [65]:
%%time

cols = list(SampledOPMDataProf.select_dtypes(include=['object']))
dropCols = ["LOCTYP",
            "LOCTYPT",
            "OCCTYP",
            "OCCTYPT",
            "PPTYP",
            "PPTYPT",
            "AGYTYP",
            "OCCFAM",
            "PPGROUP",
            "PAYPLAN",
            "TOATYP",
            "WSTYP",
            "AGYSUBT",
            "AGELVL",
            "LOSLVL",
            "LOC",
            "OCC",
            "PATCO",
            "SALLVL",
            "TOA",
            "WORKSCH"]

for i in dropCols:
    if(i in list(SampledOPMDataProf.columns)): cols.remove(i)

plotCat = SampledOPMDataProf[cols]
display(plotCat.head())
print("plotCat Has {0} Records".format(len(plotCat)))
print("Number of colums = ", len(cols))
SEP DATECODE AGELVLT LOCT OCCFAMT PATCOT PPGROUPT TOATYPT TOAT WSTYPT WORKSCHT
0 NS 201412 20-24 34-NEW JERSEY 08xx-ENGINEERING AND ARCHITECTURE Professional Standard GSEG Pay Plans Permanent 15-Competitive Service - Career-Conditional Full-time Full-time Nonseasonal
1 NS 201412 20-24 11-DISTRICT OF COLUMBIA 08xx-ENGINEERING AND ARCHITECTURE Professional Standard GSEG Pay Plans Non-permanent 20-Competitive Service Full-time Full-time Nonseasonal
2 NS 201412 20-24 51-VIRGINIA 12xx-COPYRIGHT, PATENT, AND TRADE-MARK Professional Standard GSEG Pay Plans Permanent 15-Competitive Service - Career-Conditional Full-time Full-time Nonseasonal
3 NS 201412 20-24 30-MONTANA 13xx-PHYSICAL SCIENCES Professional Standard GSEG Pay Plans Permanent 15-Competitive Service - Career-Conditional Full-time Full-time Nonseasonal
4 NS 201412 20-24 42-PENNSYLVANIA 08xx-ENGINEERING AND ARCHITECTURE Professional Standard GSEG Pay Plans Permanent 15-Competitive Service - Career-Conditional Full-time Full-time Nonseasonal
plotCat Has 16638 Records
Number of colums =  11
CPU times: user 25.4 ms, sys: 4.28 ms, total: 29.6 ms
Wall time: 26.7 ms
In [66]:
%%time

for i in cols:
    if i != 'SEP':
        plt.figure(i) # Required to create new figure each loop rather than drawing over previous object
        f, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20, 10), sharey=False)
        sns.countplot(y=i, data=plotCat, color="lightblue", ax=ax1);
        sns.countplot(y=i, data=plotCat, hue="SEP", palette="hls", ax=ax2);
        
    if i == 'AGYSUB':
        subCountPlot(i, 'SEP', 10000)
    elif i == 'LOCT':
        subCountPlot(i, 'SEP', 1000)
    elif i == 'OCCT':
        subCountPlot(i, 'SEP', 2000)
    elif i == 'PPGRD':
        subCountPlot(i, 'SEP', 6000)
    elif i == 'AGYT':
        subCountPlot(i, 'SEP', 3000)
CPU times: user 2.49 s, sys: 29.3 ms, total: 2.51 s
Wall time: 2.48 s
/usr/local/es7/lib/python3.5/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
<matplotlib.figure.Figure at 0x7f76c1652a20>
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
<matplotlib.figure.Figure at 0x7f76c24b4390>
<matplotlib.figure.Figure at 0x7f76c0f8c940>
<matplotlib.figure.Figure at 0x7f77b2773fd0>
<matplotlib.figure.Figure at 0x7f76c30af240>
<matplotlib.figure.Figure at 0x7f76c250ae10>
<matplotlib.figure.Figure at 0x7f7702742d68>
<matplotlib.figure.Figure at 0x7f76bb21d198>
<matplotlib.figure.Figure at 0x7f76b7b84080>
<matplotlib.figure.Figure at 0x7f76c0d873c8>
In [67]:
%%time

for i in cols:
    if i != 'SEP':
        percBarPlot(i, 'SEP', len(plotCat.SEP.drop_duplicates()))
CPU times: user 1.56 s, sys: 17.7 ms, total: 1.58 s
Wall time: 1.56 s
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
In [68]:
%%time

sns.set(style="whitegrid", palette="pastel", color_codes=True)

sns.violinplot(x="PATCOT", y="SALARY", data=SampledOPMDataProf, split=True,
               inner="quart")
sns.despine(left=True)
CPU times: user 1.05 s, sys: 5.97 s, total: 7.02 s
Wall time: 165 ms
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
In [69]:
%%time

# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="SEP", y="SALARY", data=SampledOPMDataProf, split=True,
               inner="quart")
sns.despine(left=True)
CPU times: user 207 ms, sys: 88.9 ms, total: 296 ms
Wall time: 192 ms
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
In [70]:
#%%time
#
#sns.factorplot(x="SEP", y="SALARY", col="PATCOT",
#               data=SampledOPMDataProf,
#               kind="violin", split=True, aspect=.5, size=15);
In [71]:
#%%time
#
#sns.factorplot(x="SEP", y="SALARY", col="PATCOT", data=SampledOPMDataProf,
#               kind="violin", split=True, aspect=.4, size=10);
In [72]:
%%time

g = sns.PairGrid(data=SampledOPMDataProf,
                 x_vars=["SEP","PATCOT"],
                 y_vars=["SALARY", "LOS", "LowerLimitAge", "YearsToRetirement"],
                 aspect=1, size=10)
g.map(sns.violinplot, palette="pastel", inner="quart");
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
CPU times: user 5.22 s, sys: 25.1 s, total: 30.3 s
Wall time: 1.47 s
In [ ]:
 

Encode Categorical Attributes, and Remove Description Columns for Analysis Prep

Now that we have the dataset sampled, we still have some legwork necessary to convert our categorical attributes into binary integer values. Below we walk through this process for the following Attributes:

  • AGELVL
  • LOC
  • SALLVL
  • TOA
  • OCCTYP
  • OCCFAM
  • PPTYP
  • PPGROUP
  • TOATYP

Once these attributes have been encoded and description columns removed, we end up with a total of 2446 attributes in our dataset for analysis in our model generation.

In [73]:
# Clean up old objects no longer needed, to clear up memory
process = psutil.Process(os.getpid())
print("Memory Usage before Cleanup: ", process.memory_info().rss)

if 'AGELVL' in dir():
    del AGELVL
if 'AggIndAvgSalary' in dir():
    del AggIndAvgSalary
if 'AggIndAvgSalary2' in dir():
    del AggIndAvgSalary2
if 'AggSEPCount_EFDATE_LOC' in dir():
    del AggSEPCount_EFDATE_LOC
if 'AggSEPCount_EFDATE_OCC' in dir():
    del AggSEPCount_EFDATE_OCC
if 'AggStrat' in dir():
    del AggStrat
if 'DATECODE' in dir():
    del DATECODE
if 'EMPColList' in dir():
    del EMPColList
if 'EMPDataOrig4Q' in dir():
    del EMPDataOrig4Q
if 'maxSize' in dir():
    del maxSize
if 'OPMColList' in dir():
    del OPMColList
if 'OPMDataFiles' in dir():
    del OPMDataFiles
if 'OPMDataList' in dir():
    del OPMDataList
if 'OPMDataMerged' in dir():
    del OPMDataMerged
if 'OPMDataOrig' in dir():
    del OPMDataOrig
if 'SEP' in dir():
    del SEP
if 'SampleSize' in dir():
    del SampleSize
if 'SampledOPMStratumData' in dir():
    del SampledOPMStratumData
if 'SampledOPMStratumDataList' in dir():
    del SampledOPMStratumDataList
if 'StratCountSample' in dir():
    del StratCountSample
if 'StratSampleSize' in dir():
    del StratSampleSize
if 'JTL' in dir():
    del JTL
    
process = psutil.Process(os.getpid())
print("Memory Usage after Cleanup: ", process.memory_info().rss)
Memory Usage before Cleanup:  23058849792
Memory Usage after Cleanup:  21385568256
In [74]:
display(SampledOPMDataProf.head())
SampledOPMDataProf.info()
SEP DATECODE AGELVL GSEGRD LOC PATCO TOA WORKSCH SALARY LOS AGELVLT LOCTYP LOCTYPT LOCT OCCTYP OCCTYPT OCCFAM OCCFAMT PATCOT PPTYP PPTYPT PPGROUP PPGROUPT TOATYP TOATYPT TOAT WSTYP WSTYPT WORKSCHT SEPCount_EFDATE_OCC SEPCount_EFDATE_LOC IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement BLS_FEDERAL_OtherSep_Rate BLS_FEDERAL_Quits_Rate BLS_FEDERAL_TotalSep_Level BLS_FEDERAL_JobOpenings_Rate BLS_FEDERAL_OtherSep_Level BLS_FEDERAL_Quits_Level BLS_FEDERAL_JobOpenings_Level BLS_FEDERAL_Layoffs_Rate BLS_FEDERAL_Layoffs_Level BLS_FEDERAL_TotalSep_Rate SALARYLog LOSLog SEPCount_EFDATE_OCCLog SEPCount_EFDATE_LOCLog IndAvgSalaryLog
0 NS 201412 B 11.0 34 1 15 F 65377.0 2.4 20-24 1 United States 34-NEW JERSEY 1 White Collar 08 08xx-ENGINEERING AND ARCHITECTURE Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 50.0 233 66358.662093 -981.662093 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 11.087926 0.875473 3.912023 5.451038 11.102830
1 NS 201412 B 7.0 11 1 20 F 42631.0 0.7 20-24 1 United States 11-DISTRICT OF COLUMBIA 1 White Collar 08 08xx-ENGINEERING AND ARCHITECTURE Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 2 Non-permanent 20-Competitive Service 1 Full-time Full-time Nonseasonal 3.0 1260 42631.000000 0.000000 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.660337 -0.356661 1.098612 7.138867 10.660337
2 NS 201412 B 11.0 51 1 15 F 77658.0 2.3 20-24 1 United States 51-VIRGINIA 1 White Collar 12 12xx-COPYRIGHT, PATENT, AND TRADE-MARK Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 26.0 1133 78919.462629 -1261.462629 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 11.260070 0.832913 3.258097 7.032624 11.276183
3 NS 201412 B 9.0 30 1 15 F 47923.0 3.4 20-24 1 United States 30-MONTANA 1 White Collar 13 13xx-PHYSICAL SCIENCES Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 9.0 198 53700.843750 -5777.843750 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.777351 1.223778 2.197225 5.288267 10.891184
4 NS 201412 B 7.0 42 1 15 F 54911.0 5.0 20-24 1 United States 42-PENNSYLVANIA 1 White Collar 08 08xx-ENGINEERING AND ARCHITECTURE Professional 1 General Schedule and Equivalently Graded (GSEG... 11 Standard GSEG Pay Plans 1 Permanent 15-Competitive Service - Career-Conditional 1 Full-time Full-time Nonseasonal 23.0 558 49910.782051 5000.217949 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.913469 1.609440 3.135494 6.324359 10.817992
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16638 entries, 0 to 16637
Data columns (total 50 columns):
SEP                              16638 non-null object
DATECODE                         16638 non-null object
AGELVL                           16638 non-null object
GSEGRD                           16638 non-null float64
LOC                              16638 non-null object
PATCO                            16638 non-null object
TOA                              16638 non-null object
WORKSCH                          16638 non-null object
SALARY                           16638 non-null float64
LOS                              16638 non-null float64
AGELVLT                          16638 non-null object
LOCTYP                           16638 non-null object
LOCTYPT                          16638 non-null object
LOCT                             16638 non-null object
OCCTYP                           16638 non-null object
OCCTYPT                          16638 non-null object
OCCFAM                           16638 non-null object
OCCFAMT                          16638 non-null object
PATCOT                           16638 non-null object
PPTYP                            16638 non-null object
PPTYPT                           16638 non-null object
PPGROUP                          16638 non-null object
PPGROUPT                         16638 non-null object
TOATYP                           16638 non-null object
TOATYPT                          16638 non-null object
TOAT                             16638 non-null object
WSTYP                            16638 non-null object
WSTYPT                           16638 non-null object
WORKSCHT                         16638 non-null object
SEPCount_EFDATE_OCC              16638 non-null float64
SEPCount_EFDATE_LOC              16638 non-null int64
IndAvgSalary                     16638 non-null float64
SalaryOverUnderIndAvg            16638 non-null float64
LowerLimitAge                    16638 non-null float64
YearsToRetirement                16638 non-null float64
BLS_FEDERAL_OtherSep_Rate        16638 non-null float64
BLS_FEDERAL_Quits_Rate           16638 non-null float64
BLS_FEDERAL_TotalSep_Level       16638 non-null int64
BLS_FEDERAL_JobOpenings_Rate     16638 non-null float64
BLS_FEDERAL_OtherSep_Level       16638 non-null int64
BLS_FEDERAL_Quits_Level          16638 non-null int64
BLS_FEDERAL_JobOpenings_Level    16638 non-null int64
BLS_FEDERAL_Layoffs_Rate         16638 non-null float64
BLS_FEDERAL_Layoffs_Level        16638 non-null int64
BLS_FEDERAL_TotalSep_Rate        16638 non-null float64
SALARYLog                        16638 non-null float64
LOSLog                           16638 non-null float64
SEPCount_EFDATE_OCCLog           16638 non-null float64
SEPCount_EFDATE_LOCLog           16638 non-null float64
IndAvgSalaryLog                  16638 non-null float64
dtypes: float64(18), int64(6), object(26)
memory usage: 6.3+ MB
In [75]:
%%time

if os.path.isfile(PickleJarPath+"/OPMAnalysisDataNoFam.pkl"):
    print("Found the File! Loading Pickle Now!")
    OPMAnalysisDataNoFam = unpickleObject("OPMAnalysisDataNoFam")
else:

    OPMAnalysisDataNoFam = SampledOPMDataProf.copy()

    cols = ["GENDER",
            "DATECODE",
            "QTR",
            "COUNT",
            "AGYTYPT",
            "AGYT",
            "AGYSUB",
            "AGYSUBT",
            "QTR",
            "AGELVLT",
            "LOSLVL",
            "LOSLVLT",
            "LOCTYPT",
            "LOCT",
            "OCCTYP",
            "OCCTYPT",
            "OCCFAM",
            "OCCFAMT",
            "OCC",
            "OCCT",
            "PATCO",
            "PPGRD",
            "PATCOT",
            "PPTYPT",
            "PPGROUPT",
            "PAYPLAN",
            "PAYPLANT",
            "SALLVLT",
            "TOATYPT",
            "TOAT",
            "WSTYP",
            "WSTYPT",
            "WORKSCH",
            "WORKSCHT",
            "SALARY",
            "LOS",
            "SEPCount_EFDATE_OCC",
            "SEPCount_EFDATE_LOC"
           ]



    #delete cols from analysis data
    for col in cols:
        if col in list(OPMAnalysisDataNoFam.columns):
            del OPMAnalysisDataNoFam[col]

    OPMAnalysisDataNoFam.info()

    cols = ["AGELVL",
            "LOC",
            "SALLVL",
            "TOA",
            "AGYTYP",
            "AGY",
            "LOCTYP",
            "PPTYP",
            "PPGROUP",
            "TOATYP"
           ]

    #Split Values for cols 
    for col in cols:
        if col in list(OPMAnalysisDataNoFam.columns):
            AttSplit = pd.get_dummies(OPMAnalysisDataNoFam[col],prefix=col)
            display(AttSplit.head())
            OPMAnalysisDataNoFam = pd.concat((OPMAnalysisDataNoFam,AttSplit),axis=1) # add back into the dataframe
            del OPMAnalysisDataNoFam[col]

    pickleObject(OPMAnalysisDataNoFam, "OPMAnalysisData")
        
display(OPMAnalysisDataNoFam.head())
print("Number of Columns: ",len(OPMAnalysisDataNoFam.columns))
OPMAnalysisDataNoFam.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16638 entries, 0 to 16637
Data columns (total 28 columns):
SEP                              16638 non-null object
AGELVL                           16638 non-null object
GSEGRD                           16638 non-null float64
LOC                              16638 non-null object
TOA                              16638 non-null object
LOCTYP                           16638 non-null object
PPTYP                            16638 non-null object
PPGROUP                          16638 non-null object
TOATYP                           16638 non-null object
IndAvgSalary                     16638 non-null float64
SalaryOverUnderIndAvg            16638 non-null float64
LowerLimitAge                    16638 non-null float64
YearsToRetirement                16638 non-null float64
BLS_FEDERAL_OtherSep_Rate        16638 non-null float64
BLS_FEDERAL_Quits_Rate           16638 non-null float64
BLS_FEDERAL_TotalSep_Level       16638 non-null int64
BLS_FEDERAL_JobOpenings_Rate     16638 non-null float64
BLS_FEDERAL_OtherSep_Level       16638 non-null int64
BLS_FEDERAL_Quits_Level          16638 non-null int64
BLS_FEDERAL_JobOpenings_Level    16638 non-null int64
BLS_FEDERAL_Layoffs_Rate         16638 non-null float64
BLS_FEDERAL_Layoffs_Level        16638 non-null int64
BLS_FEDERAL_TotalSep_Rate        16638 non-null float64
SALARYLog                        16638 non-null float64
LOSLog                           16638 non-null float64
SEPCount_EFDATE_OCCLog           16638 non-null float64
SEPCount_EFDATE_LOCLog           16638 non-null float64
IndAvgSalaryLog                  16638 non-null float64
dtypes: float64(15), int64(5), object(8)
memory usage: 3.6+ MB
AGELVL_B AGELVL_C AGELVL_D AGELVL_E AGELVL_F AGELVL_G AGELVL_H AGELVL_I AGELVL_J AGELVL_K
0 1 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0
LOC_01 LOC_02 LOC_04 LOC_05 LOC_06 LOC_08 LOC_09 LOC_10 LOC_11 LOC_12 LOC_13 LOC_15 LOC_16 LOC_17 LOC_18 LOC_19 LOC_20 LOC_21 LOC_22 LOC_23 LOC_24 LOC_25 LOC_26 LOC_27 LOC_28 LOC_29 LOC_30 LOC_31 LOC_32 LOC_33 LOC_34 LOC_35 LOC_36 LOC_37 LOC_38 LOC_39 LOC_40 LOC_41 LOC_42 LOC_44 LOC_45 LOC_46 LOC_47 LOC_48 LOC_49 LOC_50 LOC_51 LOC_53 LOC_54 LOC_55 LOC_56
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
TOA_10 TOA_15 TOA_20 TOA_30 TOA_32 TOA_35 TOA_38 TOA_40 TOA_42 TOA_44 TOA_45 TOA_48
0 0 1 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0 0 0 0 0
LOCTYP_1
0 1
1 1
2 1
3 1
4 1
PPTYP_1
0 1
1 1
2 1
3 1
4 1
PPGROUP_11 PPGROUP_12
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
TOATYP_1 TOATYP_2
0 1 0
1 0 1
2 1 0
3 1 0
4 1 0
SEP GSEGRD IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement BLS_FEDERAL_OtherSep_Rate BLS_FEDERAL_Quits_Rate BLS_FEDERAL_TotalSep_Level BLS_FEDERAL_JobOpenings_Rate BLS_FEDERAL_OtherSep_Level BLS_FEDERAL_Quits_Level BLS_FEDERAL_JobOpenings_Level BLS_FEDERAL_Layoffs_Rate BLS_FEDERAL_Layoffs_Level BLS_FEDERAL_TotalSep_Rate SALARYLog LOSLog SEPCount_EFDATE_OCCLog SEPCount_EFDATE_LOCLog IndAvgSalaryLog AGELVL_B AGELVL_C AGELVL_D AGELVL_E AGELVL_F AGELVL_G AGELVL_H AGELVL_I AGELVL_J AGELVL_K LOC_01 LOC_02 LOC_04 LOC_05 LOC_06 LOC_08 LOC_09 LOC_10 LOC_11 LOC_12 LOC_13 LOC_15 LOC_16 LOC_17 LOC_18 LOC_19 LOC_20 LOC_21 LOC_22 LOC_23 LOC_24 LOC_25 LOC_26 LOC_27 LOC_28 LOC_29 LOC_30 LOC_31 LOC_32 LOC_33 LOC_34 LOC_35 LOC_36 LOC_37 LOC_38 LOC_39 LOC_40 LOC_41 LOC_42 LOC_44 LOC_45 LOC_46 LOC_47 LOC_48 LOC_49 LOC_50 LOC_51 LOC_53 LOC_54 LOC_55 LOC_56 TOA_10 TOA_15 TOA_20 TOA_30 TOA_32 TOA_35 TOA_38 TOA_40 TOA_42 TOA_44 TOA_45 TOA_48 LOCTYP_1 PPTYP_1 PPGROUP_11 PPGROUP_12 TOATYP_1 TOATYP_2
0 NS 11.0 66358.662093 -981.662093 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 11.087926 0.875473 3.912023 5.451038 11.102830 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0
1 NS 7.0 42631.000000 0.000000 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.660337 -0.356661 1.098612 7.138867 10.660337 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1
2 NS 11.0 78919.462629 -1261.462629 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 11.260070 0.832913 3.258097 7.032624 11.276183 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0
3 NS 9.0 53700.843750 -5777.843750 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.777351 1.223778 2.197225 5.288267 10.891184 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0
4 NS 7.0 49910.782051 5000.217949 20.0 37.0 0.5 0.4 30 2.2 12 10 62 0.3 7 1.1 10.913469 1.609440 3.135494 6.324359 10.817992 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0
Number of Columns:  100
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16638 entries, 0 to 16637
Data columns (total 100 columns):
SEP                              16638 non-null object
GSEGRD                           16638 non-null float64
IndAvgSalary                     16638 non-null float64
SalaryOverUnderIndAvg            16638 non-null float64
LowerLimitAge                    16638 non-null float64
YearsToRetirement                16638 non-null float64
BLS_FEDERAL_OtherSep_Rate        16638 non-null float64
BLS_FEDERAL_Quits_Rate           16638 non-null float64
BLS_FEDERAL_TotalSep_Level       16638 non-null int64
BLS_FEDERAL_JobOpenings_Rate     16638 non-null float64
BLS_FEDERAL_OtherSep_Level       16638 non-null int64
BLS_FEDERAL_Quits_Level          16638 non-null int64
BLS_FEDERAL_JobOpenings_Level    16638 non-null int64
BLS_FEDERAL_Layoffs_Rate         16638 non-null float64
BLS_FEDERAL_Layoffs_Level        16638 non-null int64
BLS_FEDERAL_TotalSep_Rate        16638 non-null float64
SALARYLog                        16638 non-null float64
LOSLog                           16638 non-null float64
SEPCount_EFDATE_OCCLog           16638 non-null float64
SEPCount_EFDATE_LOCLog           16638 non-null float64
IndAvgSalaryLog                  16638 non-null float64
AGELVL_B                         16638 non-null uint8
AGELVL_C                         16638 non-null uint8
AGELVL_D                         16638 non-null uint8
AGELVL_E                         16638 non-null uint8
AGELVL_F                         16638 non-null uint8
AGELVL_G                         16638 non-null uint8
AGELVL_H                         16638 non-null uint8
AGELVL_I                         16638 non-null uint8
AGELVL_J                         16638 non-null uint8
AGELVL_K                         16638 non-null uint8
LOC_01                           16638 non-null uint8
LOC_02                           16638 non-null uint8
LOC_04                           16638 non-null uint8
LOC_05                           16638 non-null uint8
LOC_06                           16638 non-null uint8
LOC_08                           16638 non-null uint8
LOC_09                           16638 non-null uint8
LOC_10                           16638 non-null uint8
LOC_11                           16638 non-null uint8
LOC_12                           16638 non-null uint8
LOC_13                           16638 non-null uint8
LOC_15                           16638 non-null uint8
LOC_16                           16638 non-null uint8
LOC_17                           16638 non-null uint8
LOC_18                           16638 non-null uint8
LOC_19                           16638 non-null uint8
LOC_20                           16638 non-null uint8
LOC_21                           16638 non-null uint8
LOC_22                           16638 non-null uint8
LOC_23                           16638 non-null uint8
LOC_24                           16638 non-null uint8
LOC_25                           16638 non-null uint8
LOC_26                           16638 non-null uint8
LOC_27                           16638 non-null uint8
LOC_28                           16638 non-null uint8
LOC_29                           16638 non-null uint8
LOC_30                           16638 non-null uint8
LOC_31                           16638 non-null uint8
LOC_32                           16638 non-null uint8
LOC_33                           16638 non-null uint8
LOC_34                           16638 non-null uint8
LOC_35                           16638 non-null uint8
LOC_36                           16638 non-null uint8
LOC_37                           16638 non-null uint8
LOC_38                           16638 non-null uint8
LOC_39                           16638 non-null uint8
LOC_40                           16638 non-null uint8
LOC_41                           16638 non-null uint8
LOC_42                           16638 non-null uint8
LOC_44                           16638 non-null uint8
LOC_45                           16638 non-null uint8
LOC_46                           16638 non-null uint8
LOC_47                           16638 non-null uint8
LOC_48                           16638 non-null uint8
LOC_49                           16638 non-null uint8
LOC_50                           16638 non-null uint8
LOC_51                           16638 non-null uint8
LOC_53                           16638 non-null uint8
LOC_54                           16638 non-null uint8
LOC_55                           16638 non-null uint8
LOC_56                           16638 non-null uint8
TOA_10                           16638 non-null uint8
TOA_15                           16638 non-null uint8
TOA_20                           16638 non-null uint8
TOA_30                           16638 non-null uint8
TOA_32                           16638 non-null uint8
TOA_35                           16638 non-null uint8
TOA_38                           16638 non-null uint8
TOA_40                           16638 non-null uint8
TOA_42                           16638 non-null uint8
TOA_44                           16638 non-null uint8
TOA_45                           16638 non-null uint8
TOA_48                           16638 non-null uint8
LOCTYP_1                         16638 non-null uint8
PPTYP_1                          16638 non-null uint8
PPGROUP_11                       16638 non-null uint8
PPGROUP_12                       16638 non-null uint8
TOATYP_1                         16638 non-null uint8
TOATYP_2                         16638 non-null uint8
dtypes: float64(15), int64(5), object(1), uint8(79)
memory usage: 3.9+ MB
CPU times: user 290 ms, sys: 13.6 ms, total: 303 ms
Wall time: 334 ms

Below is a display of all remaining attributes and their corresponding data types for analysis

In [76]:
%%time

data_type = []
for idx, col in enumerate(OPMAnalysisDataNoFam.columns):
    data_type.append(OPMAnalysisDataNoFam.dtypes[idx])

summary_df = {'Attribute Name' : pd.Series(OPMAnalysisDataNoFam.columns, index = range(len(OPMAnalysisDataNoFam.columns))), 'Data Type' : pd.Series(data_type, index = range(len(OPMAnalysisDataNoFam.columns)))}
summary_df = pd.DataFrame(summary_df)
display(summary_df)

del data_type, summary_df
Attribute Name Data Type
0 SEP object
1 GSEGRD float64
2 IndAvgSalary float64
3 SalaryOverUnderIndAvg float64
4 LowerLimitAge float64
5 YearsToRetirement float64
6 BLS_FEDERAL_OtherSep_Rate float64
7 BLS_FEDERAL_Quits_Rate float64
8 BLS_FEDERAL_TotalSep_Level int64
9 BLS_FEDERAL_JobOpenings_Rate float64
10 BLS_FEDERAL_OtherSep_Level int64
11 BLS_FEDERAL_Quits_Level int64
12 BLS_FEDERAL_JobOpenings_Level int64
13 BLS_FEDERAL_Layoffs_Rate float64
14 BLS_FEDERAL_Layoffs_Level int64
15 BLS_FEDERAL_TotalSep_Rate float64
16 SALARYLog float64
17 LOSLog float64
18 SEPCount_EFDATE_OCCLog float64
19 SEPCount_EFDATE_LOCLog float64
20 IndAvgSalaryLog float64
21 AGELVL_B uint8
22 AGELVL_C uint8
23 AGELVL_D uint8
24 AGELVL_E uint8
25 AGELVL_F uint8
26 AGELVL_G uint8
27 AGELVL_H uint8
28 AGELVL_I uint8
29 AGELVL_J uint8
30 AGELVL_K uint8
31 LOC_01 uint8
32 LOC_02 uint8
33 LOC_04 uint8
34 LOC_05 uint8
35 LOC_06 uint8
36 LOC_08 uint8
37 LOC_09 uint8
38 LOC_10 uint8
39 LOC_11 uint8
40 LOC_12 uint8
41 LOC_13 uint8
42 LOC_15 uint8
43 LOC_16 uint8
44 LOC_17 uint8
45 LOC_18 uint8
46 LOC_19 uint8
47 LOC_20 uint8
48 LOC_21 uint8
49 LOC_22 uint8
50 LOC_23 uint8
51 LOC_24 uint8
52 LOC_25 uint8
53 LOC_26 uint8
54 LOC_27 uint8
55 LOC_28 uint8
56 LOC_29 uint8
57 LOC_30 uint8
58 LOC_31 uint8
59 LOC_32 uint8
60 LOC_33 uint8
61 LOC_34 uint8
62 LOC_35 uint8
63 LOC_36 uint8
64 LOC_37 uint8
65 LOC_38 uint8
66 LOC_39 uint8
67 LOC_40 uint8
68 LOC_41 uint8
69 LOC_42 uint8
70 LOC_44 uint8
71 LOC_45 uint8
72 LOC_46 uint8
73 LOC_47 uint8
74 LOC_48 uint8
75 LOC_49 uint8
76 LOC_50 uint8
77 LOC_51 uint8
78 LOC_53 uint8
79 LOC_54 uint8
80 LOC_55 uint8
81 LOC_56 uint8
82 TOA_10 uint8
83 TOA_15 uint8
84 TOA_20 uint8
85 TOA_30 uint8
86 TOA_32 uint8
87 TOA_35 uint8
88 TOA_38 uint8
89 TOA_40 uint8
90 TOA_42 uint8
91 TOA_44 uint8
92 TOA_45 uint8
93 TOA_48 uint8
94 LOCTYP_1 uint8
95 PPTYP_1 uint8
96 PPGROUP_11 uint8
97 PPGROUP_12 uint8
98 TOATYP_1 uint8
99 TOATYP_2 uint8
CPU times: user 25.4 ms, sys: 0 ns, total: 25.4 ms
Wall time: 24.3 ms

Dimensionality Reduction using Principal Component Analysis

We also scale the data values to remove bias in our models due to different attribute scales. Without scaling the data, attributes such as SALARY and LOS would carry heavier weights when compared against the binary encoded attributes and BLS data. This would cause unbalanced and improperly analyzed data for model creation.

In [77]:
OPMScaledAnalysisData = OPMAnalysisDataNoFam.copy()
del OPMScaledAnalysisData["SEP"]
In [78]:
%%time

OPMAnalysisScalerFit = MinMaxScaler().fit(OPMScaledAnalysisData)
## Pickle for later re-use if needed
pickleObject(OPMAnalysisScalerFit, "OPMAnalysisScalerFit")

OPMScaledAnalysisData = pd.DataFrame(OPMAnalysisScalerFit.transform(OPMScaledAnalysisData), columns = OPMScaledAnalysisData.columns)
CPU times: user 15.3 ms, sys: 2.55 ms, total: 17.9 ms
Wall time: 19.4 ms
In [79]:
display(OPMScaledAnalysisData.head())
GSEGRD IndAvgSalary SalaryOverUnderIndAvg LowerLimitAge YearsToRetirement BLS_FEDERAL_OtherSep_Rate BLS_FEDERAL_Quits_Rate BLS_FEDERAL_TotalSep_Level BLS_FEDERAL_JobOpenings_Rate BLS_FEDERAL_OtherSep_Level BLS_FEDERAL_Quits_Level BLS_FEDERAL_JobOpenings_Level BLS_FEDERAL_Layoffs_Rate BLS_FEDERAL_Layoffs_Level BLS_FEDERAL_TotalSep_Rate SALARYLog LOSLog SEPCount_EFDATE_OCCLog SEPCount_EFDATE_LOCLog IndAvgSalaryLog AGELVL_B AGELVL_C AGELVL_D AGELVL_E AGELVL_F AGELVL_G AGELVL_H AGELVL_I AGELVL_J AGELVL_K LOC_01 LOC_02 LOC_04 LOC_05 LOC_06 LOC_08 LOC_09 LOC_10 LOC_11 LOC_12 LOC_13 LOC_15 LOC_16 LOC_17 LOC_18 LOC_19 LOC_20 LOC_21 LOC_22 LOC_23 LOC_24 LOC_25 LOC_26 LOC_27 LOC_28 LOC_29 LOC_30 LOC_31 LOC_32 LOC_33 LOC_34 LOC_35 LOC_36 LOC_37 LOC_38 LOC_39 LOC_40 LOC_41 LOC_42 LOC_44 LOC_45 LOC_46 LOC_47 LOC_48 LOC_49 LOC_50 LOC_51 LOC_53 LOC_54 LOC_55 LOC_56 TOA_10 TOA_15 TOA_20 TOA_30 TOA_32 TOA_35 TOA_38 TOA_40 TOA_42 TOA_44 TOA_45 TOA_48 LOCTYP_1 PPTYP_1 PPGROUP_11 PPGROUP_12 TOATYP_1 TOATYP_2
0 0.50 0.141962 0.473742 0.0 1.0 0.666667 0.333333 0.117647 0.230769 0.444444 0.125 0.194444 0.0 0.0 0.083333 0.241563 0.784939 0.596123 0.452208 0.291827 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
1 0.00 0.013540 0.477842 0.0 1.0 0.666667 0.333333 0.117647 0.230769 0.444444 0.125 0.194444 0.0 0.0 0.083333 0.039837 0.706870 0.167409 0.824554 0.035089 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
2 0.50 0.209945 0.472574 0.0 1.0 0.666667 0.333333 0.117647 0.230769 0.444444 0.125 0.194444 0.0 0.0 0.083333 0.322776 0.782243 0.496476 0.801116 0.392408 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
3 0.25 0.073454 0.453709 0.0 1.0 0.666667 0.333333 0.117647 0.230769 0.444444 0.125 0.194444 0.0 0.0 0.083333 0.095041 0.807008 0.334818 0.416300 0.169028 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
4 0.00 0.052941 0.498728 0.0 1.0 0.666667 0.333333 0.117647 0.230769 0.444444 0.125 0.194444 0.0 0.0 0.083333 0.159258 0.831444 0.477794 0.644868 0.126562 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0

PCA Principal Components defined

Our objective, is to reduce dimensionality through identification of principal components. We have chosen 100 as the maximum number of components to be produced, given our hopes are to reduce the number of attributes needed for a model. We will review each component's explained variance further to determine the proper number of components to be included later during model generation. Note randomized PCA was chosen in order to use singular value decomposition in our dimensionality reduction efforts due to the large size of our data set.

In [80]:
%%time

seed = len(OPMScaledAnalysisData)

print(OPMScaledAnalysisData.shape)
pca_class = PCA(n_components=len(OPMScaledAnalysisData.columns), svd_solver='randomized', random_state=seed)

pca_class.fit(OPMScaledAnalysisData)
(16638, 99)
CPU times: user 3.74 s, sys: 16.2 s, total: 19.9 s
Wall time: 471 ms

Below, the resulting components have been ordered by eigenvector value and these values portrayed as ratios of variance explained by each component. In order to identify the principal components to be included during model generation, we review the rate at which explained variance decreases in significance from one principal component to the next. Accompanying these proportion values is a scree plot representing these same values in visual form. By plotting the scree plot, it is easier to judge where this rate of decreasing explained variance occurs. Note the rate of change in explained variance among the first 8 principal components, with another less significant change through the 22th component. After the 22th component, the rate of decreasing explained variance begins to somewhat flatten out.

In [81]:
%%time

#The amount of variance that each PC explains
var= pca_class.explained_variance_ratio_

sns.set(font_scale=1)
plt.plot(range(1,len(OPMScaledAnalysisData.columns)+1), var*100, marker = '.', color = 'red', markerfacecolor = 'black')
plt.xlabel('Principal Components')
plt.ylabel('Percentage of Explained Variance')
plt.title('Scree Plot')
plt.axis([0, len(OPMScaledAnalysisData.columns)+1, -0.1, 9])

np.set_printoptions(suppress=True)
print(np.round(var, decimals=4)*100)
[ 11.33   9.61   6.07   5.74   4.4    3.84   3.38   3.24   3.18   2.92
   2.84   2.75   2.69   2.64   2.58   2.53   2.28   1.99   1.88   1.75
   1.75   1.24   1.15   1.     0.97   0.94   0.83   0.81   0.75   0.72
   0.64   0.62   0.59   0.52   0.49   0.48   0.46   0.43   0.42   0.39
   0.35   0.35   0.32   0.32   0.31   0.3    0.28   0.27   0.27   0.26
   0.25   0.24   0.22   0.22   0.21   0.2    0.2    0.2    0.19   0.18
   0.18   0.17   0.17   0.14   0.14   0.13   0.11   0.1    0.09   0.09
   0.09   0.08   0.07   0.07   0.06   0.06   0.05   0.05   0.04   0.04
   0.02   0.02   0.01   0.01   0.     0.     0.     0.     0.     0.     0.
   0.     0.     0.     0.     0.     0.     0.     0.  ]
CPU times: user 372 ms, sys: 2.12 s, total: 2.5 s
Wall time: 59.6 ms
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

By now referring to the cumulative variance values and associated plot below, it may be seen that the cumulative variance increases in a fairly consistent parabola curve. In attempts to acheive a cumulative variance explained of greater than 80%, we end at 22 principal components. For this reason, 22 principal components may be selected as being the most appropriate for separation classification modeling given the variables among these data.

In [82]:
#Cumulative Variance explains
var1=np.cumsum(np.round(pca_class.explained_variance_ratio_, decimals=4)*100)

plt.plot(range(1,len(OPMScaledAnalysisData.columns)+1), var1, marker = '.', color = 'green', markerfacecolor = 'black')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance (Sum %)')
plt.title('Cumulative Variance Plot')
plt.axis([0, len(OPMScaledAnalysisData.columns)+1, 10, len(OPMScaledAnalysisData.columns)+1])

print(var1)
[ 11.33  20.94  27.01  32.75  37.15  40.99  44.37  47.61  50.79  53.71
  56.55  59.3   61.99  64.63  67.21  69.74  72.02  74.01  75.89  77.64
  79.39  80.63  81.78  82.78  83.75  84.69  85.52  86.33  87.08  87.8
  88.44  89.06  89.65  90.17  90.66  91.14  91.6   92.03  92.45  92.84
  93.19  93.54  93.86  94.18  94.49  94.79  95.07  95.34  95.61  95.87
  96.12  96.36  96.58  96.8   97.01  97.21  97.41  97.61  97.8   97.98
  98.16  98.33  98.5   98.64  98.78  98.91  99.02  99.12  99.21  99.3
  99.39  99.47  99.54  99.61  99.67  99.73  99.78  99.83  99.87  99.91
  99.93  99.95  99.96  99.97  99.97  99.97  99.97  99.97  99.97  99.97
  99.97  99.97  99.97  99.97  99.97  99.97  99.97  99.97  99.97]
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

We proceed to analyze the first 4 component Feature Loadings more carefully. See below, plots of the top 10 loadings for each component.

In [83]:
plt.rcParams['figure.figsize'] = (20, 12)
fig = plt.figure()

for i in range(0,4):
    components = pd.Series(pca_class.components_[i], index=OPMScaledAnalysisData.columns)

    maxcomponent = pd.Series(pd.DataFrame(abs(components).sort_values(ascending=False).head(10)).index)

    matplotlib.rc('xtick', labelsize=8)


    ax = fig.add_subplot(2,2,i + 1)
       
    weightsplot = pd.Series(components, index=maxcomponent)
    weightsplot.plot(title = "Principal Component "+ str(i+1), kind='bar', color = 'Tomato', ax = ax)

plt.tight_layout()
plt.show()
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
In [84]:
MaxPC = 22

PCList = []
for i in range(0,MaxPC):
    components = pd.Series(pca_class.components_[i], index=OPMScaledAnalysisData.columns)

    maxcomponent = pd.Series(pd.DataFrame(abs(components).sort_values(ascending=False).head(15)).index)

    PCList.append(maxcomponent)

PCList = pd.concat(PCList).drop_duplicates().sort_values(ascending=True).reset_index(drop = True)
print(PCList)
PCList = list(PCList)
0                          AGELVL_C
1                          AGELVL_D
2                          AGELVL_E
3                          AGELVL_F
4                          AGELVL_G
5                          AGELVL_H
6                          AGELVL_I
7                          AGELVL_J
8                          AGELVL_K
9     BLS_FEDERAL_JobOpenings_Level
10     BLS_FEDERAL_JobOpenings_Rate
11        BLS_FEDERAL_Layoffs_Level
12         BLS_FEDERAL_Layoffs_Rate
13       BLS_FEDERAL_OtherSep_Level
14        BLS_FEDERAL_OtherSep_Rate
15          BLS_FEDERAL_Quits_Level
16           BLS_FEDERAL_Quits_Rate
17       BLS_FEDERAL_TotalSep_Level
18        BLS_FEDERAL_TotalSep_Rate
19                           GSEGRD
20                     IndAvgSalary
21                  IndAvgSalaryLog
22                           LOC_06
23                           LOC_08
24                           LOC_11
25                           LOC_12
26                           LOC_13
27                           LOC_17
28                           LOC_24
29                           LOC_35
30                           LOC_39
31                           LOC_48
32                           LOC_51
33                    LowerLimitAge
34                       PPGROUP_11
35                       PPGROUP_12
36                        SALARYLog
37           SEPCount_EFDATE_LOCLog
38           SEPCount_EFDATE_OCCLog
39                         TOATYP_1
40                         TOATYP_2
41                           TOA_10
42                           TOA_15
43                           TOA_20
44                           TOA_30
45                           TOA_38
46                           TOA_40
47                           TOA_48
48                YearsToRetirement
dtype: object

Total of 48 features of the original 98 are identified, by taking the top 15 feature loadings within the first 22 components as determined above as the appropriate components to maximize variance explained. We may now, optionally utilize these 48 features identified, or utilize principal component vectors for analysis in the next steps.

Separation Response Weights

Due to the unproportional number of observations in each separation type in our dataset, we need to create weightings. using SciKit's class_weight algorithm, we compute an array of weights to be used downstream in our models.

In [85]:
OPMClassWeights = class_weight.compute_class_weight("balanced", OPMAnalysisDataNoFam["SEP"].drop_duplicates(), OPMAnalysisDataNoFam["SEP"])

display(stratumProf)
display(pd.DataFrame({"Weight": OPMClassWeights, "SEP": OPMAnalysisDataNoFam["SEP"].drop_duplicates()}))
SEP StratCount StratCountSample
0 NS 1259283 4000.0
1 SA 5463 4000.0
2 SC 7423 4000.0
3 SD 9476 4000.0
4 SH 15 15.0
5 SI 631 631.0
SEP Weight
0 NS 0.692730
4003 SA 0.693944
7999 SC 0.693423
11998 SD 0.694291
15992 SH 184.866667
16007 SI 4.394612

Predicting Separation

We have chosen to utilize Stratified KFold Cross Validation for our classification analysis, with 5 folds. This means, that from our original sample size of 16,638, each "fold" will save off approximately 20% as test observations utilizing the rest as training observations all while keeping the ratio of classes equal amongst customers and subscribers. This process will occur through 5 iterations, or folds, to allow us to cross validate our results amongst different test/train combinations. We have utilized a random_state seed equal to the length of the original sampled dataset to ensure reproducible results.

In [86]:
seed = len(OPMAnalysisDataNoFam)

cv = StratifiedKFold(n_splits = 5, random_state = seed)
print(OPMAnalysisDataNoFam.shape)
print(cv)
(16638, 100)
StratifiedKFold(n_splits=5, random_state=16638, shuffle=False)

Random Forest Classification

Max Depth The maximum depth (levels) in the tree. When a value is set, the tree may not split further once this level has been met regardless of how many nodes are in the leaf.

Max Features Number of features to consider when looking for a split.

Minimum Samples in Leaf Minimum number of samples required to be in a leaf node. Splits may not occur which cause the number of samples in a leaf to be less than this value. Too low a value here leads to overfitting the tree to train data.

Minimum Samples to Split Minimum number fo samples required to split a node. Care was taken during parameter tests to keep the ratio between Min Samples in Leaf and Min Samples to Split equal to that of the default values (1:2). This was done to allow an even 50/50 split on nodes which match the lowest granularity split criteria. similar to the min samples in leaf, too low a value here leads to overfitting the tree to train data.

n_estimators Number of Trees generated in the forest. Increasing the number of trees, in our models increased accuracy while decreasing performance. We tuned to provide output that completed all 10 iterations in under 10 minutes.

Not Complete#### After 13 iterations of modifying the above parameters, we land on a final winner based on the highest average Accuracy value across all iterations. Average Accuracy values in our 10 test/train iterations ranged from 70.2668 % from default inputs of the random forest classification model to a value of 72.5192 % in the best tuned model fit. Although the run-time of this model parameter choice is the largest performed, we decided to remain with these inputs due to the amount increase in accuracy. As mentioned previously, we tuned the n_estimators parameter to ensure we stayed under 10 minutes execution. Parameter inputs for the final Random Forest Classification model with the KD Tree Algorithm are as follows: ###Not Complete

max_depth max_features min_samples_leaf min_samples_split n_estimators
TBD TBD TBD TBD TBD
In [87]:
%%time

def rfc_explor(n_estimators,
               max_features,
               max_depth, 
               min_samples_split,
               min_samples_leaf,
               Data        = OPMAnalysisDataNoFam,
               cols        = PCList,
               cv          = cv,
               seed        = seed):
    startTime = datetime.now()
    y = Data["SEP"].values # get the labels we want    
    
    X = Data[cols].as_matrix()
    
    rfc_clf = RandomForestClassifier(n_estimators=n_estimators, max_features = max_features, max_depth=max_depth, min_samples_split = min_samples_split, min_samples_leaf = min_samples_leaf, class_weight = "balanced", n_jobs=-1, random_state = seed) # get object
    
    # setup pipeline to take PCA, then fit a clf model
    clf_pipe = Pipeline(
        [('minMaxScaler', MinMaxScaler()),
         ('CLF',rfc_clf)]
    )

    accuracy = cross_val_score(clf_pipe, X, y, cv=cv.split(X, y)) # this also can help with parallelism
    MeanAccuracy =  sum(accuracy)/len(accuracy)
    accuracy = np.append(accuracy, MeanAccuracy)
    endTime = datetime.now()
    TotalTime = endTime - startTime
    accuracy = np.append(accuracy, TotalTime)
    
    #print(TotalTime)
    #print(accuracy)
    
    return accuracy
CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 8.58 µs
In [88]:
%%time

def rfc_explor_w_PCA(n_estimators,
               max_features,
               max_depth, 
               min_samples_split,
               min_samples_leaf,
               PCA,
               Data        = OPMAnalysisDataNoFam,
               cv          = cv,
               seed        = seed):
    startTime = datetime.now()
    y = Data["SEP"].values # get the labels we want    
    
    X = Data.drop("SEP", axis=1).as_matrix()
    
    rfc_clf = RandomForestClassifier(n_estimators=n_estimators, max_features = max_features, max_depth=max_depth, min_samples_split = min_samples_split, min_samples_leaf = min_samples_leaf, class_weight = "balanced", n_jobs=-1, random_state = seed) # get object
    
    # setup pipeline to take PCA, then fit a clf model
    clf_pipe = Pipeline(
        [('minMaxScaler', MinMaxScaler()),
         ('PCA', PCA),
         ('CLF',rfc_clf)]
    )

    accuracy = cross_val_score(clf_pipe, X, y, cv=cv.split(X, y)) # this also can help with parallelism
    MeanAccuracy =  sum(accuracy)/len(accuracy)
    accuracy = np.append(accuracy, MeanAccuracy)
    endTime = datetime.now()
    TotalTime = endTime - startTime
    accuracy = np.append(accuracy, TotalTime)
    
    #print(TotalTime)
    #print(accuracy)
    
    return accuracy
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 7.63 µs
In [89]:
%%time

acclist = [] 
fullColumns = list(OPMAnalysisDataNoFam.columns)

for i in fullColumns:
    if i == "SEP": fullColumns.remove(i)

n_estimators       =  [10    , 10     , 10    , 10    , 10    , 10    , 10    , 10    , 10    , 10    , 10  , 5    , 15   ]  
max_features       =  ['auto', 'auto' , 'auto', 'auto', 'auto', 'auto', 'auto', 14    , 14    , 14    , 14  , 14   , 14   ] 
max_depth          =  [None  , None   , None  , None  , None  , None  , None  , None  , 1000  , 500   , 100 , 1000 , 1000 ] 
min_samples_split  =  [2     , 8      , 12    , 16    , 20    , 50    , 80    , 50    , 50    , 50    , 50  , 50   , 50   ] 
min_samples_leaf   =  [1     , 4      , 6     , 8     , 10    , 25    , 40    , 25    , 25    , 25    , 25  , 25   , 25   ]

##Model with all Raw Scaled Features
for i in range(0,len(n_estimators)):
    acclist.append(rfc_explor(n_estimators      = n_estimators[i],
                              max_features      = max_features[i],
                              max_depth         = max_depth[i],
                              min_samples_split = min_samples_split[i],
                              min_samples_leaf  = min_samples_leaf[i],
                              cols              = fullColumns
                             )
                  )

rfcdf = pd.DataFrame(pd.concat([pd.DataFrame({  "ModelVersion": "All Raw Features",
                                                "n_estimators": n_estimators,          
                                                "max_features": max_features,         
                                                "max_depth": max_depth,        
                                                "min_samples_split": min_samples_split,
                                                "min_samples_leaf": min_samples_leaf   
                                              }),
                               pd.DataFrame(acclist)], axis = 1).reindex())
rfcdf.columns = ['ModelVersion', 'max_depth', 'max_features', 'min_samples_leaf','min_samples_split', 'n_estimators', 'Iteration 0', 'Iteration 1', 'Iteration 2', 'Iteration 3', 'Iteration 4', 'MeanAccuracy', 'RunTime']
display(rfcdf)
del rfcdf, acclist

acclist = []

## Model with only top 15 raw Scaled Principal Features 
for i in range(0,len(n_estimators)):
    acclist.append(rfc_explor(n_estimators      = n_estimators[i],
                              max_features      = max_features[i],
                              max_depth         = max_depth[i],
                              min_samples_split = min_samples_split[i],
                              min_samples_leaf  = min_samples_leaf[i]
                             )
                  )

rfcdf = pd.DataFrame(pd.concat([pd.DataFrame({  "ModelVersion": "Top 15 Raw from PC",
                                                "n_estimators": n_estimators,          
                                                "max_features": max_features,         
                                                "max_depth": max_depth,        
                                                "min_samples_split": min_samples_split,
                                                "min_samples_leaf": min_samples_leaf   
                                              }),
                               pd.DataFrame(acclist)], axis = 1).reindex())
rfcdf.columns = ['ModelVersion', 'max_depth', 'max_features', 'min_samples_leaf','min_samples_split', 'n_estimators', 'Iteration 0', 'Iteration 1', 'Iteration 2', 'Iteration 3', 'Iteration 4', 'MeanAccuracy', 'RunTime']
display(rfcdf)
del rfcdf, acclist

### Model with PCA
acclist = []

for i in range(0,len(n_estimators)):
    acclist.append(rfc_explor_w_PCA(n_estimators      = n_estimators[i],
                                    max_features      = max_features[i],
                                    max_depth         = max_depth[i],
                                    min_samples_split = min_samples_split[i],
                                    min_samples_leaf  = min_samples_leaf[i],
                                    PCA               = PCA(n_components=22, svd_solver='randomized', random_state = seed)
                                   )
                  )

rfcdf = pd.DataFrame(pd.concat([pd.DataFrame({  "ModelVersion": "With PCA",
                                                "n_estimators": n_estimators,          
                                                "max_features": max_features,         
                                                "max_depth": max_depth,        
                                                "min_samples_split": min_samples_split,
                                                "min_samples_leaf": min_samples_leaf   
                                              }),
                               pd.DataFrame(acclist)], axis = 1).reindex())
rfcdf.columns = ['ModelVersion', 'max_depth', 'max_features', 'min_samples_leaf','min_samples_split', 'n_estimators', 'Iteration 0', 'Iteration 1', 'Iteration 2', 'Iteration 3', 'Iteration 4', 'MeanAccuracy', 'RunTime']
display(rfcdf)

#'Iteration 5', 'Iteration 6', 'Iteration 7', 'Iteration 8', 'Iteration 9', 
ModelVersion max_depth max_features min_samples_leaf min_samples_split n_estimators Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 MeanAccuracy RunTime
0 All Raw Features NaN auto 1 2 10 0.283784 0.408353 0.361478 0.424406 0.438797 0.383364 00:00:01.589324
1 All Raw Features NaN auto 4 8 10 0.262763 0.336238 0.431791 0.418996 0.446917 0.379341 00:00:01.627333
2 All Raw Features NaN auto 6 12 10 0.300901 0.376803 0.404447 0.438233 0.495038 0.403084 00:00:01.627815
3 All Raw Features NaN auto 8 16 10 0.258859 0.404748 0.450721 0.436429 0.446316 0.399415 00:00:01.622389
4 All Raw Features NaN auto 10 20 10 0.253153 0.402945 0.424880 0.417193 0.482105 0.396055 00:00:01.625296
5 All Raw Features NaN auto 25 50 10 0.188589 0.396034 0.449820 0.426510 0.476090 0.387409 00:00:01.622874
6 All Raw Features NaN auto 40 80 10 0.200601 0.399639 0.406550 0.397956 0.436090 0.368167 00:00:01.623864
7 All Raw Features NaN 14 25 50 10 0.301802 0.451623 0.445312 0.405470 0.487218 0.418285 00:00:01.619373
8 All Raw Features 1000.0 14 25 50 10 0.301802 0.451623 0.445312 0.405470 0.487218 0.418285 00:00:01.626200
9 All Raw Features 500.0 14 25 50 10 0.301802 0.451623 0.445312 0.405470 0.487218 0.418285 00:00:01.620679
10 All Raw Features 100.0 14 25 50 10 0.301802 0.451623 0.445312 0.405470 0.487218 0.418285 00:00:01.620165
11 All Raw Features 1000.0 14 25 50 5 0.282883 0.431190 0.406851 0.421701 0.465564 0.401638 00:00:01.590052
12 All Raw Features 1000.0 14 25 50 15 0.260961 0.463642 0.454627 0.397355 0.453534 0.406024 00:00:01.655058
ModelVersion max_depth max_features min_samples_leaf min_samples_split n_estimators Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 MeanAccuracy RunTime
0 Top 15 Raw from PC NaN auto 1 2 10 0.263063 0.374700 0.373498 0.394349 0.457444 0.372611 00:00:01.535244
1 Top 15 Raw from PC NaN auto 4 8 10 0.254655 0.377404 0.372296 0.413285 0.443008 0.372129 00:00:01.559690
2 Top 15 Raw from PC NaN auto 6 12 10 0.277477 0.425180 0.384916 0.407574 0.422556 0.383541 00:00:01.557620
3 Top 15 Raw from PC NaN auto 8 16 10 0.180781 0.417969 0.379808 0.408176 0.455038 0.368354 00:00:01.554577
4 Top 15 Raw from PC NaN auto 10 20 10 0.229129 0.417668 0.384916 0.388939 0.471579 0.378446 00:00:01.554795
5 Top 15 Raw from PC NaN auto 25 50 10 0.169069 0.368990 0.421875 0.363390 0.415038 0.347672 00:00:01.554476
6 Top 15 Raw from PC NaN auto 40 80 10 0.143544 0.397536 0.398738 0.407274 0.476090 0.364636 00:00:01.555187
7 Top 15 Raw from PC NaN 14 25 50 10 0.184685 0.442308 0.379808 0.400661 0.495639 0.380620 00:00:01.548168
8 Top 15 Raw from PC 1000.0 14 25 50 10 0.184685 0.442308 0.379808 0.400661 0.495639 0.380620 00:00:01.552192
9 Top 15 Raw from PC 500.0 14 25 50 10 0.184685 0.442308 0.379808 0.400661 0.495639 0.380620 00:00:01.551468
10 Top 15 Raw from PC 100.0 14 25 50 10 0.184685 0.442308 0.379808 0.400661 0.495639 0.380620 00:00:01.548065
11 Top 15 Raw from PC 1000.0 14 25 50 5 0.202402 0.445613 0.373197 0.415389 0.481805 0.383681 00:00:01.518456
12 Top 15 Raw from PC 1000.0 14 25 50 15 0.184084 0.448918 0.392428 0.409678 0.489323 0.384886 00:00:01.582961
ModelVersion max_depth max_features min_samples_leaf min_samples_split n_estimators Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 MeanAccuracy RunTime
0 With PCA NaN auto 1 2 10 0.304505 0.445913 0.332933 0.357680 0.413534 0.370913 00:00:02.471499
1 With PCA NaN auto 4 8 10 0.346847 0.413462 0.379808 0.367899 0.469774 0.395558 00:00:02.257518
2 With PCA NaN auto 6 12 10 0.345646 0.420673 0.357873 0.382627 0.449624 0.391288 00:00:02.421437
3 With PCA NaN auto 8 16 10 0.350150 0.453726 0.377404 0.370905 0.455038 0.401444 00:00:02.300919
4 With PCA NaN auto 10 20 10 0.324925 0.416166 0.380108 0.389240 0.464962 0.395080 00:00:02.408613
5 With PCA NaN auto 25 50 10 0.346547 0.436599 0.378906 0.414788 0.477594 0.410887 00:00:02.275492
6 With PCA NaN auto 40 80 10 0.351351 0.405649 0.390325 0.399158 0.461955 0.401688 00:00:02.310364
7 With PCA NaN 14 25 50 10 0.365465 0.431490 0.374399 0.391043 0.452030 0.402886 00:00:03.009168
8 With PCA 1000.0 14 25 50 10 0.365465 0.431490 0.374399 0.391043 0.452030 0.402886 00:00:03.020894
9 With PCA 500.0 14 25 50 10 0.365465 0.431490 0.374399 0.391043 0.452030 0.402886 00:00:03.042085
10 With PCA 100.0 14 25 50 10 0.365465 0.431490 0.374399 0.391043 0.452030 0.402886 00:00:02.974908
11 With PCA 1000.0 14 25 50 5 0.357057 0.426983 0.380409 0.391043 0.436090 0.398316 00:00:02.759180
12 With PCA 1000.0 14 25 50 15 0.380480 0.438702 0.388522 0.388037 0.454436 0.410035 00:00:03.115035
CPU times: user 3min 9s, sys: 5min 9s, total: 8min 18s
Wall time: 1min 15s

We have created a function to be re-used for our cross-validation Accuracy Scores. Inputs of PCA components, Model CLF object, original sample data, and a CV containing our test/train splits allow us to easily produce an array of Accuracy Scores for the different permutations of models tested. A XXXXXXTBDXXXXX plot is also displayed depicting a view of the misclassification values for each iteration. Finally, a confusion matrix is displayed for the last test/train iteration for further interpretation on results.

In [90]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.rcParams['figure.figsize'] = (18, 6)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, round(cm[i, j],2),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

    plt.show()
In [93]:
%%time

def compute_kfold_scores_Classification( clf,
                                         Data     = OPMAnalysisDataNoFam,
                                         cols     = PCList,
                                         cv       = cv):

    y = Data["SEP"].values # get the labels we want    
    
    y = np.where(y == 'NS', 0, 
                 np.where(y == 'SA', 1,
                          np.where(y == 'SC', 2,
                                   np.where(y == 'SD', 3,
                                            np.where(y == 'SH', 4,
                                                     5
                                                    )
                                           )
                                  )
                         )
                )
    
    X = Data[cols].as_matrix()


    # Run classifier with cross-validation and plot ROC curves

    # setup pipeline to take PCA, then fit a clf model
    clf_pipe = Pipeline(
        [('minMaxScaler', MinMaxScaler()),
         ('CLF',clf)]
    )

    colors = cycle(['cyan', 'indigo', 'seagreen', 'yellow', 'blue', 'darkorange', 'pink', 'darkred', 'dimgray', 'maroon', 'coral'])
    
    accuracy = []
    #logloss = []
    
    for (train, test), color in zip(cv.split(X, y), colors):
        clf_pipe.fit(X[train],y[train])  # train object
        y_hat = clf_pipe.predict(X[test]) # get test set preditions
        
        a = float(mt.accuracy_score(y[test],y_hat))
        #l = float(mt.log_loss(y[test], y_hat))
        
        accuracy.append(round(a,5)) 

        #logloss.append(round(l,5)) 
    
    #print("Accuracy Ratings across all iterations: {0}\n\n\
#Average Accuracy: {1}\n\n\
#Log Loss Values across all iterations: {2}\n\n\
#Average Log Loss: {3}\n".format(accuracy, round(sum(accuracy)/len(accuracy),5), logloss,round(sum(logloss)/len(logloss),5)))

    print("Accuracy Ratings across all iterations: {0}\n\n\
Average Accuracy: {1}\n".format(accuracy, round(sum(accuracy)/len(accuracy),5)))

    
    ytestnames = np.where(y[test] ==  0,'NS', 
                          np.where(y[test] ==  1,'SA',
                                   np.where(y[test] ==  2,'SC',
                                            np.where(y[test] ==  3,'SD',
                                                     np.where(y[test] ==  4,'SH',
                                                              'SI'
                                                             )
                                                    )
                                           )
                                  )
                         )
    
    yhatnames  = np.where(y_hat ==  0,'NS', 
                          np.where(y_hat ==  1,'SA',
                                   np.where(y_hat ==  2,'SC',
                                            np.where(y_hat ==  3,'SD',
                                                     np.where(y_hat ==  4,'SH',
                                                              'SI'
                                                             )
                                                    )
                                           )
                                  )
                         )
    #print(set(list(y_hat)))
    print("confusion matrix\n{0}\n".format(pd.crosstab(ytestnames, yhatnames, rownames = ['True'], colnames = ['Predicted'], margins = True)))
        
        # Plot non-normalized confusion matrix
    plt.figure()
    plot_confusion_matrix(confusion_matrix(y[test], y_hat), 
                          classes   =["NS",  "SA",   "SC", "SD",  "SH",  "SI"], 
                          normalize =True,
                          title     ='Confusion matrix, with normalization')
    
    return clf_pipe.named_steps['CLF'], accuracy
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 17.6 µs
In [94]:
%%time

rfc_clf = RandomForestClassifier(n_estimators       = 15, 
                                 max_features       = 14, 
                                 max_depth          = 1000.0, 
                                 min_samples_split  = 50, 
                                 min_samples_leaf   = 25, 
                                 class_weight       = "balanced",
                                 n_jobs             = -1, 
                                 random_state       = seed) # get object
    
rfc_clf, rfc_acc = compute_kfold_scores_Classification(rfc_clf, cols = fullColumns)
Accuracy Ratings across all iterations: [0.26096, 0.46364, 0.45463, 0.39735, 0.45353]

Average Accuracy: 0.40602

confusion matrix
Predicted    NS  SA   SC    SD  SH   SI   All
True                                         
NS          623   0    0   177   0    0   800
SA          558  54   58    70   7   52   799
SC          569  18   74    45  14   79   799
SD           43   4    3   727   2   19   798
SH            2   0    0     1   0    0     3
SI           67   2    7    16   4   30   126
All        1862  78  142  1036  27  180  3325

Normalized confusion matrix
[[ 0.77875     0.          0.          0.22125     0.          0.        ]
 [ 0.69837297  0.06758448  0.07259074  0.08760951  0.00876095  0.06508135]
 [ 0.71214018  0.02252816  0.09261577  0.0563204   0.0175219   0.09887359]
 [ 0.05388471  0.00501253  0.0037594   0.91102757  0.00250627  0.02380952]
 [ 0.66666667  0.          0.          0.33333333  0.          0.        ]
 [ 0.53174603  0.01587302  0.05555556  0.12698413  0.03174603  0.23809524]]
/usr/local/es7/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

CPU times: user 3.81 s, sys: 926 ms, total: 4.73 s
Wall time: 2.16 s
In [ ]:
 
In [ ]: